ASR
About
The Automatic Speech Recognition engine is used to convert speech to text. It is an alternative User Interface to DTMF in IVR applications.
Click here to expand Table of Contents
- 1 Introduction
- 2 Nuance
- 2.1 Download
- 2.2 System
- 2.3 Installation
- 2.4 License Manager
- 2.5 Start and Test Speech Server
- 2.6 Logs
- 2.7 mod_unimrcp configuration
- 2.8 Sample Lua IVR
- 2.9 Sample English Grammar
- 2.10 Sample Hindi Grammar
- 3 LumenVox
- 4 Vestec
- 5 Simmortel
- 6 Loquendo
- 7 CMU Sphinx
- 8 HTK
- 9 Kaldi
Introduction
Automated Speech Recognition is currently available via mod_pocketsphinx and mod_unimrcp.
Engines from the following vendors have been evaluated.
Nuance
Nuance is the monopolistic supplier of speech recognition engines. Accuracy and performance are good but cost is prohibitively high for most applications. The SDK is free trial for 2 months. Cost of per port perpetual license varies from $800 - $2000 depending on vocabulary size and natural language. Indian English and Hindi are good. Other Indian languages are available but need to be evaluated for accuracy.
Download
NRec-9.0.19-i386-rhel3.tar.gz NRec-9.0.1-en-IN.i386-rhel3.tar.gz NRec-9.0.1-hi-IN.i386-rhel3.tar.gz NLICMGR-11.7.0-x86_64-linux.tar.gz eval-rec-9.lic
System
CentOS 5.5 x86_64
Installation
tar xvzf NRec-9.0.19-i386-rhel3.tar.gz ./install.sh tar xvzf NRec-9.0.1-en-IN.i386-rhel3.tar.gz tar xvzf NRec-9.0.1-hi-IN.i386-rhel3.tar.gz rpm -ivh NRec-en-IN-9.0-1.i386-rhel3.rpm rpm -ivh NRec-hi-IN-9.0-1.i386-rhel3.rpm
License Manager
yum -y install redhat-lsb tar xvzf NLICMGR-11.7.0-x86_64-linux.tar.gz cd Nuance_License_Manager ./install.sh cd /usr/local/Nuance/license_manager/license cp /root/eval-rec-9.lic . cd ../components ./set-new-lic-file.sh /usr/local/Nuance/license_manager/license/eval-rec-9.lic
Check that the license log file /usr/local/Nuance/license_manager/license/nuance-lic.log has the following contents, which means that the evaluation license file has been correctly configured by the License Manager.
19:55:22 (lmgrd) License file(s): /usr/local/Nuance/license_manager/license/eval-rec-9.lic 19:55:22 (lmgrd) lmgrd tcp-port 27000 19:55:22 (lmgrd) Starting vendor daemons ...
Check that the Nuance License Manager is running.
Nuance License Manager
# ps aux | grep -i nuance
root 4887 0.0 0.0 15912 1244 pts/0 S 19:55 0:00 /usr/local/Nuance/license_manager/components/lmgrd -c /usr/local/Nuance/license_manager/license/eval-rec-9.lic -l /usr/local/Nuance/license_manager/license/nuance-lic.log
root 4888 0.0 0.0 32236 2084 ? Ssl 19:55 0:00 swilmgrd -T localhost.localdomain 11.7 3 -c /usr/local/Nuance/license_manager/license/eval-rec-9.lic --lmgrd_start 4f8988d2
Start and Test Speech Server
service NSSservice start
Check that Nuance client is able to talk to Nuance Speech Server from a different machine.
Logs
Nuance Recognizer logs are in /usr/local/Nuance/Recognizer/data
mod_unimrcp configuration
cat conf/mrcp_profiles/nuance-5.0-mrcp-v2.xml
nuance-5.0-mrcp-v2.xml
<include>
<profile name="nuance5-mrcp2" version="2">
<param name="client-ip" value="$${local_ip_v4}"/>
<param name="client-port" value="5090"/>
<param name="server-ip" value="10.60.20.47"/>
<param name="server-port" value="5060"/>
<param name="sip-transport" value="udp"/>
<param name="ua-name" value="FreeSWITCH"/>
<param name="rtp-ip" value="$${local_ip_v4}"/>
<param name="rtp-port-min" value="4000"/>
<param name="rtp-port-max" value="5000"/>
<param name="rtcp" value="1"/>
<param name="rtcp-bye" value="2"/>
<param name="rtcp-tx-interval" value="5000"/>
<param name="rtcp-rx-resolution" value="1000"/>
<param name="codecs" value="PCMU PCMA L16/96/8000"/>
<synthparams>
</synthparams>
<recogparams>
</recogparams>
</profile>
</include>
cat conf/autoload_configs/unimrcp.conf.xml
unimrcp.conf.xml
<configuration name="unimrcp.conf" description="UniMRCP Client">
<settings>
<param name="default-tts-profile" value="voxeo-prophecy8.0-mrcp1"/>
<param name="default-asr-profile" value="nuance5-mrcp2"/>
<param name="log-level" value="DEBUG"/>
<param name="enable-profile-events" value="false"/>
<param name="max-connection-count" value="100"/>
<param name="offer-new-connection" value="1"/>
</settings>
<profiles>
<X-PRE-PROCESS cmd="include" data="../mrcp_profiles/*.xml"/>
</profiles>
</configuration>
Sample Lua IVR
In dialplan conf/dialplan/default.xml put the following extension
Dialplan Example
<extension name="unimrcp">
<condition field="destination_number" expression="^(.*)4948611$">
<action application="answer"/>
<action application="lua" data="names.lua"/>
</condition>
</extension>
cat scripts/names.lua
Lua ASR Script
session:answer()
--freeswitch.consoleLog("INFO","Called extension is '" .. argv[1] .. "'\n")
welcome= "abhishek/welcome_to_knowlarity.wav"
menu = "abhishek/speak_name.wav"
nohear = "abhishek/sorry_no_hear.wav"
nounderstand = "abhishek/sorry_no_understand.wav"
forward = "abhishek/forwarding_to.wav"
thankyou = "ivr/8000/ivr-thank_you_for_calling.wav"
goodbye = "voicemail/8000/vm-goodbye.wav"
--
grammar = "names"
asrtag = "names"
no_input_timeout = 5000
recognition_timeout = 5000
confidence_threshold = 0.2
--
session:streamFile(welcome)
--freeswitch.consoleLog("INFO","Prompt file is '" .. prompt .. "'\n")
--
tryagain = 1
while (tryagain == 1) do
--
session:execute("play_and_detect_speech",menu .. "detect:unimrcp {start-input-timers=false,no-input-timeout=" .. no_input_timeout .. ",recognition-timeout=" .. recognition_timeout .. "}" .. grammar)
xml = session:getVariable('detect_speech_result')
_,_,pre,result,suf = string.find(xml,"(.*)" .. asrtag .. ":(.*)}(.*)")
_,_,pre,confidence,suf = string.find(xml,"(.*)confidence=\"(.*)\"(.*)")
--
if (result == nil) then
freeswitch.consoleLog("CRIT","Result is 'nil'\n")
freeswitch.consoleLog("CRIT","Confidence is 'nil'\n")
session:streamFile(nohear)
tryagain = 1
elseif (tonumber(confidence) < confidence_threshold) then
freeswitch.consoleLog("CRIT","Result is '" .. result .. "'\n")
freeswitch.consoleLog("CRIT","Confidence is LOW '" .. confidence .. "'\n")
session:streamFile(nounderstand)
tryagain = 1
else
freeswitch.consoleLog("CRIT","Result is '" .. result .. "'\n")
freeswitch.consoleLog("CRIT","Confidence is HIGH '" .. confidence .. "'\n")
prompt = "abhishek/" .. result .. ".wav"
session:streamFile(prompt)
tryagain = 0
end
end
--
session:streamFile(forward)
-- put logic to forward call here
--
session:streamFile(thankyou)
session:sleep(250)
session:streamFile(goodbye)
session:hangup()
Sample English Grammar
cat grammar/sr.gram
English Grammar File
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar" version="1.0" xml:lang="en-IN" root="sr" tag-format="swi-semantics/1.0">
<rule id="sr" scope="public">
<one-of>
<item>
<ruleref uri="#sales"/>
<tag>sr='sales'</tag>
</item>
<item>
<ruleref uri="#support"/>
<tag>sr='support'</tag>
</item>
<item>
<ruleref uri="#voicemail"/>
<tag>sr='voicemail'</tag>
</item>
<item>
<ruleref uri="#fax"/>
<tag>sr='fax'</tag>
</item>
</one-of>
</rule>
<rule id="sales">
<one-of>
<item>sales</item>
</one-of>
</rule>
<rule id="support">
<one-of>
<item>support</item>
</one-of>
</rule>
<rule id="voicemail">
<one-of>
<item>voicemail</item>
</one-of>
</rule>
<rule id="fax">
<one-of>
<item>fax</item>
</one-of>
</rule>
</grammar>
Sample Hindi Grammar
cat grammar/test.gram
Hindi Grammar File
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar" version="1.0" xml:lang="hi-IN" root="names" tag-format="swi-semantics/1.0">
<rule id="names" scope="public">
<one-of>
<item>
<ruleref uri="#avni"/>
<tag>names='avni'</tag>
</item>
<item>
<ruleref uri="#bhagirath"/>
<tag>names='bhagirath'</tag>
</item>
<item>
<ruleref uri="#abhishek"/>
<tag>names='abhishek singh'</tag>
</item>
</one-of>
</rule>
<rule id="avni">
<one-of>
<item>अवनी</item>
</one-of>
</rule>
<rule id="bhagirath">
<one-of>
<item>भागीरथ</item>
</one-of>
</rule>
<rule id="abhishek">
<one-of>
<item>अभिषेक</item>
<item>अभिषेक सिंह</item>
</one-of>
</rule>
</grammar>
LumenVox
Started off from Sphinx, the free and open source project at CMU but considerable proprietary development has been done for MRCP and acoustic modelling. Indian English and Hindi are available but the cost of SDK is around $4000, so evaluation looks expensive.
Vestec
Vestec provides SDK for $25 or free evaluation. Per port license fee is around $200. Indian English and Hindi are available and to be evaluated.
Simmortel
Simmortel uses a mix of open source and proprietary product to deliver good accuracy for medium vocabulary applications in Indian English and Hindi. However, MRCP is not available and CPU usage for concurrent calls is very high.
Loquendo
Provides good European languages. Acquired by Nuance.
CMU Sphinx
CMU Sphinx is open source and free. It works well if you have a trained acoustic model for your language and application. MRCP integration needs to be done for any real application.
HTK
The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.
HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software supports HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. The HTK release contains extensive documentation and examples.
HTK was originally developed at the Machine Intelligence Laboratory (formerly known as the Speech Vision and Robotics Group) of the Cambridge University Engineering Department (CUED) where it has been used to build CUED's large vocabulary speech recognition systems (see CUED HTK LVR). In 1993 Entropic Research Laboratory Inc. acquired the rights to sell HTK and the development of HTK was fully transferred to Entropic in 1995 when the Entropic Cambridge Research Laboratory Ltd was established. HTK was sold by Entropic until 1999 when Microsoft bought Entropic. Microsoft has now licensed HTK back to CUED and is providing support so that CUED can redistribute HTK and provide development support via the HTK3 web site. See History of HTK for more details.
While Microsoft retains the copyright to the original HTK code, everybody is encouraged to make changes to the source code and contribute them for inclusion in HTK3.
Kaldi
Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Kaldi is intended for use by speech recognition researchers.