Search

Asterisk - Voicemail with Speech Recognition using Google API

Contents[Hide]

Asterisk

In a previous article I published a solution to convert Asterisk voicemail attachments from WAV to MP3 on the fly. This is done by catching the mails sent by Asterisk just before they are passed to sendmail.

I recently got the idea from Daniel Dainty to add Voice Recognition feature at the same time as mp3 encoding.

After testing different voice recognition engines, I realized that the Google Speech Recognition API is by far superior to any other solution available under Linux (Sphinx, ...).

This article will explain an approach to add voice recognition to Asterisk voicemail using the services of Google Speech Recognition API.

The principle is very simple. After doing the voicemail mp3 conversion, the script :

  1. does some pre-processing clean-up on the file,
  2. converts it to an acceptable format (flac),
  3. sends it to Google speech recognition engine,
  4. gets back the text version
  5. adds it at the end of the mail body.

This procedure has been done on a Debian Squeeze server. It should be fully compatible with an Ubuntu server. You will then just need to add sudo to the commands needing root privilege.

This article is just a Proof Of Concept.
Google Voice Recongnition API has come with Android phones and Chrome V11 speech recognition features.
As this API is still not officially public, you should not use it in any way on a production environment.
These things being said, let's go ahead.

1. Codecs and Tools

Google Voice Recognition API is using two different audio formats :

  • flac
  • speex (special format with header byte)

After trying both, voice recognition is by far more accurate with flac encoded files.

Even if flac format is 3 to 4 times bigger than speex, it weights only 15 kB per second and accuracy difference is really worth.

1.1. Install Tools

So, first thing is to install the tools used to encode and send audio files to Google API :

Terminal
# apt-get install dos2unix lame sox curl flac

1.2. Compile & Install SPEEX codec

If you plan to use flac files, you don't need to run following commands. But, in case you really want to use the speex codec (you have been warned that overall speech recognition quality is really worse), you'll need to compile a specific version of speex.

As Google API are expecting a slightly modified speex format which includes a specific header byte, a compatible speex version can be compiled from a GIT repository :

Terminal
# apt-get install build-essential autoconf libogg-dev bison git
# git clone git://github.com/QXIP/Speex-with-header-bytes.git
# cd Speex-with-header-bytes
# ./configure
# make
# make install

speexenc is now available under /usr/local/bin

2. Test Google Speech Recognition API

You first need to record a WAV file (any setting will do).

Try to record it from a good microphone, avoid any inbuilt laptop microphone, which is collecting too much of background noise.

Let's assume that you file is called english.wav and that it contains the following sentence :

"I hope you will recognize this message without any problem."

Then, whatever audio format will be used, original recording will have to go thru some pre-processing to :

  1. remove silence at the beginning and at the end using voice activity detection
  2. apply a low-pass filter with cut-off frequency of 2.5 kHz

It is needed to minimize size and maximize recognition results.

2.1. Using FLAC format

When using the flac audio format, you need to :

  1. apply pre-processing & convert file to mono 16kHz flac
  2. send the file to Google API, specifying the language

In return, you'll get an JSON stream with a utterance key giving you the text transcription and a confidence level.

Terminal
# sox english.wav -r 16000 -b 16 -c 1 audio.flac vad reverse vad reverse lowpass -2 2500
# curl --data-binary @audio.flac --header 'Content-type: audio/x-flac; rate=16000' 'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=0&maxresults=1&lang="en-US"'
{"status":0,"id":"67ea600c89e608344d47b1083300d427-1","hypotheses":[{"utterance":"I hope you will recognize this message without any problem","confidence":0.85149354}]}

The result is the exact transcription of original message with a confidence factor of 85% !

2.2. Using SPEEX format

When using the speex audio format, steps are almost similar :

  1. apply pre-processing & convert file to mono 16kHz wav
  2. encode to speex format including --headerbyte option
  3. send the file to Google API, specifying the language

In return, you'll get the same JSON file.

Terminal
# sox english.wav -t wav -r 16000 -b 16 -c 1 audio.wav vad reverse vad reverse lowpass -2 2500
# speexenc -w --headerbyte audio.wav audio.spx
Encoding 16000 Hz audio using wideband (sub-band CELP) mode (mono)
Warning: with-header-byte output will not be compatible with most decoders.
# curl --data-binary @audio.spx --header 'Content-type: audio/x-speex-with-header-byte; rate=16000' 'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=1&maxresults=1&lang=en-US'
{"status":0,"id":"0389a123bf0c42cbbe9bc25ae4377fe0-1","hypotheses":[{"utterance":"find a boat","confidence":0.49228987}]}

As you can notice, for the same input file, the result is really bad and confidence level is below 50% ...

There is only one conclusion : just avoid speex format for Google Speech Recognition API.

3. Asterisk Setup

Now that we have validated that Google voice recognition works very well with flac format, let's setup our Asterisk system to use it for voicemail.

3.1. Pre-requisite

Before going further, you should follow the procedure Asterisk - Setup voicemail to send email with mp3 attachment.

The following Asterisk voicemail scripts implementation is based on this procedure.

It only modifies the sendmailmp3 script, which is supposed to be already fully working.

3.2. Adapt sendmailmp3

As sendmailmp3 script is properly setup and running, we will modify it to include the call to voice recognition engine.

This script is located under /usr/sbin.

/usr/sbin/sendmailmp3
#!/bin/sh
# Asterisk voicemail attachment conversion script, including voice recognition 
# Use Voice Recognition Engine provided by Google API
#
# Revision history :
# 22/11/2010 - V1.0 - Creation by N. Bernaerts
# 07/02/2012 - V1.1 - Add handling of mails without attachment (thanks to Paul Thompson)
# 01/05/2012 - V1.2 - Use mktemp, pushd & popd
# 08/05/2012 - V1.3 - Change mp3 compression to CBR to solve some smartphone compatibility (thanks to Luca Mancino)
# 01/08/2012 - V1.4 - Add PATH definition to avoid any problem (thanks to Christopher Wolff)
# 31/01/2013 - V2.0 - Add Google Voice Recognition feature (thanks to Daniel Dainty idea and sponsoring :-)
# 04/02/2013 - V2.1 - Handle error in case of voicemail too long to be converted
# 16/07/2015 - V2.2 - Handle natively GSM WAV (thanks to Michael Munger)

# set language for voice recognition (en-US, en-GB, fr-FR, ...)
LANGUAGE="en-US"

# set PATH
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# save the current directory
pushd .
 
# create a temporary directory and cd to it
TMPDIR=$(mktemp -d)
cd $TMPDIR
 
# dump the stream to a temporary file
cat >> stream.org
 
# get the boundary
BOUNDARY=$(grep "boundary=" stream.org | cut -d'"' -f 2)
 
# cut the file into parts
# stream.part - header before the boundary
# stream.part1 - header after the bounday
# stream.part2 - body of the message
# stream.part3 - attachment in base64 (WAV file)
# stream.part4 - footer of the message
awk '/'$BOUNDARY'/{i++}{print > "stream.part"i}' stream.org
 
# if mail is having no audio attachment (plain text)
PLAINTEXT=$(cat stream.part1 | grep 'plain')
if [ "$PLAINTEXT" != "" ]
then
 
  # prepare to send the original stream
  cat stream.org > stream.new
 
# else, if mail is having audio attachment
else
 
  # cut the attachment into parts
  # stream.part3.head - header of attachment
  # stream.part3.wav.base64 - wav file of attachment (encoded base64)
  sed '7,$d' stream.part3 > stream.part3.wav.head
  sed '1,6d' stream.part3 > stream.part3.wav.base64
 
  # convert the base64 file to a wav file
  dos2unix -o stream.part3.wav.base64
  base64 -di stream.part3.wav.base64 > stream.part3.wav
 
  # convert wav file to mp3 file
  # -b 24 is using CBR, giving better compatibility on smartphones (you can use -b 32 to increase quality)
  # -V 2 is using VBR, a good compromise between quality and size for voice audio files
  lame -m m -b 24 stream.part3.wav stream.part3.mp3
 
  # convert back mp3 to base64 file
  base64 stream.part3.mp3 > stream.part3.mp3.base64
 
  # generate the new mp3 attachment header
  # change Type: audio/x-wav or audio/x-WAV to Type: audio/mpeg
  # change name="msg----.wav" or name="msg----.WAV" to name="msg----.mp3"
  sed 's/x-[wW][aA][vV]/mpeg/g' stream.part3.wav.head | sed 's/.[wW][aA][vV]/.mp3/g' > stream.part3.mp3.head
 
  # convert wav file to flac compatible for Google speech recognition
  sox stream.part3.wav -r 16000 -b 16 -c 1 audio.flac vad reverse vad reverse lowpass -2 2500

  # call Google Voice Recognition sending flac file as POST
  curl --data-binary @audio.flac --header 'Content-type: audio/x-flac; rate=16000' 'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=0&lang='$LANGUAGE'&maxresults=1' 1>audio.txt

  # extract the transcript and confidence results
  FILETOOBIG=$(cat audio.txt | grep "<HTML>")
  TRANSCRIPT=$(cat audio.txt | cut -d"," -f3 | sed 's/^.*utterance\":\"\(.*\)\"$/\1/g')
  CONFIDENCE=$(cat audio.txt | cut -d"," -f4 | sed 's/^.*confidence\":0.\([0-9][0-9]\).*$/\1/g')

  # generate first part of mail body, converting it to LF only
  mv stream.part stream.new
  cat stream.part1 >> stream.new
  sed '$d' < stream.part2 >> stream.new

  # beginning of transcription section
  echo "---" >> stream.new

  # if audio attachment is too big
  if [ "$FILETOOBIG" != "" ]
  then
    # error message
    echo "Voice message is too long to be transcripted." >> stream.new
  else
    # append result of transcription
    echo "Message seems to be ( $CONFIDENCE% confidence ) :" >> stream.new
    echo "$TRANSCRIPT" >> stream.new
  fi

  # end of message body
  tail -1 stream.part2 >> stream.new

  # append mp3 header
  cat stream.part3.mp3.head >> stream.new
  dos2unix -o stream.new

  # append base64 mp3 to mail body, keeping CRLF
  unix2dos -o stream.part3.mp3.base64
  cat stream.part3.mp3.base64 >> stream.new
 
  # append end of mail body, converting it to LF only
  echo "" >> stream.tmp
  echo "" >> stream.tmp
  cat stream.part4 >> stream.tmp
  dos2unix -o stream.tmp
  cat stream.tmp >> stream.new
 
fi
 
# send the mail thru sendmail
cat stream.new | sendmail -t
 
# go back to original directory
popd
 
# remove all temporary files and temporary directory
rm -Rf $TMPDIR

Everything is now ready.

3.3. Result

Your next voice mail should look like this :

Email
[PBX]: Voicemail from ...

New message, 0:04 long in mailbox ..... on .....
---
Message seems to be ( 85% confidence ) :
I hope you will recognize this message without any problem

 

Hope it helps.

Signature Technoblog

This article is published "as is", without any warranty that it will work for your specific need.
If you think this article needs some complement, or simply if you think it saved you lots of time & trouble,
just let me know at This email address is being protected from spambots. You need JavaScript enabled to view it.. Cheers !

icon linux icon debian icon apache icon mysql icon php icon piwik icon googleplus