LumenVox has created an easy to use and flexible Speech Recognition API. This document will guide a developer on how to implement this API.
Opening the SpeechPort.
To utilize the Speech Recognition Engine(Speech Engine) you will need to first open a SpeechPort. This will create the link with the Speech Engine, the grammar collection and the voice channel collection. To Open the SpeechPort you will need to call the function LV_SRE_OpenPort.
HPORT LV_SRE_OpenPort (ExportLogMsg log,
void *p,
int verbosity);
Return Values:
Not NULL – This is a valid handle to the newly created SpeechPort.
NULL – This operation has failed. In most cases this is due to the license limits.
Parameters:
Log – this is a pointer to a function that will receive SpeechPort specific messages.
p – This is a pointer to user created data.
Verbosity – A number between 0 and 6 describing the level of messaging
0 – minimal logging.
6 – maximum logging.
The ExportLogMsg is a type definition of a function. This means it’s a model of how to define a function. The type definition:
typedef void (*ExportLogMsg)(const char* String,
void* p);
Parameters:
String – This is the message from the SpeechPort
P – the pointer to the user create data entered into the LV_SRE_OpenPort.
This is a call back function meaning when appropriate the SpeechPort will call it to deliver the messages while it is still running.
Example on next page…
Opening a SpeechPort:
#include <windows.h>
#include <stdio.h>
#include <LV_SRE.h>
class CMyData //this is a arbitrary class it could contain any
//information desired.
{
public:
CMyData(){m_nSomeNumber = 0; m_nLogType = 1;}
INT m_nSomeNumber;
INT m_nLogType;
};
void SPLogging(const char* String, void* p)
{
CMyData* pMD = (CMyData *) p;
If(pMD->m_nLogType == 1) //example use of userdata
{
printf(String);
}
else
{
//display the msg to a GUI.
}
}
HPORT OpenTheSpeechport(void)
{
HPORT hPort;
CMyData* pMD = new CMyData;
hPort = LV_SRE_OpenPort(SPLogging, pMD, 3);
return hPort;
}
SPLogging matches the type definition for ExportLogMsg. The CMyData is an example of what can be passed in to the LV_SRE_OpenPort’s void pointer parameter. The hPort is the handle to the SpeechPort created. It needs to be used for all future calls to this SpeechPort. It was placed in to the MyAppDoc. It could be persisting in many different ways.
Closing a SpeechPort
When the SpeechPort is no longer needed it should be destroyed. This is done using the function LV_SRE_ClosePort.
int LV_SRE_ClosePort(HPORT hport);
Return Values
LV_SUCCESS - Port has been successfully shutdown.
LV_FAILURE - The Port was unable to shutdown.
Parameters
HPORT – This must be a valid HPORT returned by LV_SRE_OpenPort.
#include <LV_SRE.h>
void CloseTheSpeechPort(HPORT hPort)
{
int nRetVal;
nRetVal = LV_SRE_ClosePort(hPort);
if (nRetVal != LV_SUCCESS)
{
//Report error
}
}
Working with the Grammar
The SpeechPort can maintain up to 64 grammars. Each grammar is dynamically defined at runtime. Simply choose which grammar to use by specifying a number between 0 – 63. A Grammar is made up of concepts that in turn are made of phrases. Each concept is a collection of phrases, which denote the same thing. If ‘yes’ is the concept then ‘yes’, ‘yes please’ and ‘yup’ could all be phrases for ‘yes’.
INT LV_SRE_AddPhrase (HPORT hport,int GrammarSet,
const char* Concept,
const char* Phrase);
Return Values
LV_SUCCESS - The command completed successfully.
LV_FAILURE - The port has not been opened.
LM_SYSTEM_ERROR -The speech recognition engine is no longer running. This is the result of a LV_SRE_ClosePort call or an unrecoverable engine error.
Parameters:
Hport – This is the HPORT returned by LV_SRE_OpenPort.
GrammarSet – The Grammar’s number 0 - 63. The concept/phrase will be added
to the Grammar specified
Concept – This is a new or existing concept
Phrase – The phrase to add to the concept
Whether adding a phrase to a new or an existing concept the same method is used. The first time a new concept is specified in the LV_SRE_AddPhrase it will be added in to the Grammar. This means each concept must have at least one phrase. This is because the concept is not evaluated for the Decode. It is the phrase that matters. This is a slight simplification. Concept grouping affects confidence scoring.
#include “LV_SRE.h”
void SPAddYesNoConcepts(HPORT hport)
{
LV_SRE_AddPhrase(hport, 1, “no”, “no”);
LV_SRE_AddPhrase(hport, 1, “no”, “nope”);
LV_SRE_AddPhrase(hport, 1, “yes”, ”(yes [please]) | correct”);
}
This function has setup the Grammar 1 with a simple yes no grammar. When decoded against Audio Data it will look for No, nope, yes, yes please, correct”. If it finds No or nope the concept No is identified. The Concept yes will be selected if any of the others are located. It is possible if the speaker uttered both yes and no, the decode will identify and add them both to the list of concepts returned.
Phrases can be added in several ways 2 were demonstrated above. The ‘no’ phrases were added one phrase at a time. The ‘yes’ phrases were added using a BNF construct. The single entry ‘(yes [please]) | correct’ equals 3 different phrases. The simple rules for the BNF:
[] = optional word
| = or operator.
()= logical grouping.
Therefore (yes [please]) | correct); = Yes, yes please, correct
Phrases can also be added with a Phoneme block. Phonemes are used tell the recognition engine the actual pronunciation of a word. Sets of phonemes when put together make a phrase. Yup = Y AH P. You can label a phoneme block with a real word or caption by using a colon. See phoneme list in appendix.
{} = phoneme block
: = label separator
{Y AH P : yup}
You can mix the three different methods as you wish.
(yes [please]) | correct | {Y AH P} | { Y AE : yeah }
This BNF has five phrases in it : Yes, yes please, correct, yup and yeah
At times the application may need to remove a concept from a grammar. This may happen when the wrong concept is picked. This may be caused by two concepts having similar phrases. If for example John Brown and John Crown are phrases, each in different concepts. The speaker utters John Brown. The system picks John Crown. The application may confirm this with the speaker. The speaker indicates no to John Crown. The concept can be removed and the old audio decoded again with the modified grammar. Now John Brown is picked because John Crown is no longer in the grammar.
int LV_SRE_RemoveConcept(HPORT hport,int GrammarSet,
const char* Concept);
Return Values
LV_SUCCESS - The concept was successfully deleted
LV_FAILURE - The Grammar Set specified is outside the valid range.
LV_SYSTEM_ERROR - The speech recognition engine is no longer running. This is the result of a LV_SRE_ClosePort call or a unrecoverable engine error.
Parameters:
hport - This is the HPORT returned by LV_SRE_OpenPort.
GrammarSet – The Grammar’s number 0 - 63. The concept and it’s phrases will
Be removed from the Grammar specified.
Concept – This is an existing concept to remove.
To reuse a grammar the LV_SRE_ResetGrammar can be called. This will remove all concepts and phrases. Leaving the grammar ready to filled out with a new set of concepts and phrases.
int LV_SRE_ResetGrammar(HPORT hport,int GrammarSet);
Return Values
LM_SUCCESS - Grammar reset.
LV_GRAMMAR_SET_OUT_OF_RANGE - GrammarSet value invalid.
Parameters:
hport - This is the HPORT returned by LV_SRE_OpenPort.
GrammarSet – The Grammar’s number 0 - 63. all concepts and phrases
will be removed from the Grammar specified.
There are several predefined grammars available. These grammars when used will ignore any user added concepts and phrases. When the built-in digits grammar is specified the only phrases evaluated against the audio are the built in digits phrases. This is true even if the application has added concepts and phrases the Grammar. Once the Standard Grammar is loaded. The LV_SRE_ResetGrammar must be called to remove it.
int LV_SRE_LoadStandardGrammar(HPORT hport,int GrammarSet,
int DefaultGrammar)
Return Values
LV_SUCCESS -- The command completed successfully.
LV_STANDARD_GRAMMAR_ALREADY_LOADED - Only one standard
grammar can be loaded for a grammar set
LV_STANDARD_GRAMMAR_OUT_OF_RANGE –The Standard grammar
value unknown
Parameters
hport - This is the HPORT returned by LV_SRE_OpenPort.
GrammarSet – The Grammar’s number 0 - 63. The concept and it’s phrases will
be removed from the Grammar specified.
StandardGrammar - The standard grammars are:
GRAMMAR_DIGITS - String of single digits, [0 – 9]
GRAMMAR_MONEY - Monetary value (Dollars and cents)
GRAMMAR_NUMBER - Numeric value like 12,000 ‘twelve thousand’, 24.45 ‘twenty-four point forty-five’, or 35 ‘thirty-five’.
GRAMMAR_LETTERS - Alphabetic letters (A-Z).
GRAMMAR_DATE - Date and time.
Adding the Audio Data
Once the grammar is defined the Audio Data needs to be set. The LumenVox Speech Engine is hardware independent. This allows for greater flexibility when collecting the data. The Audio Data need to be in one of three formats u-law 8bit, PCM 8-bit or PCM 16-bit. The audio needs to be in raw audio format. This means there can not be header information present. Wav files will have a header, where raw files do not. The Audio is stored in a voice channel. Each SpeechPort has 64 different voice channels. This allows 64 different audio data samples to be stored in a SpeechPort at once. Most application should only need 2, one for the main question and one for a confirmation yes no question.
int LV_SRE_LoadVoiceChannel(HPORT hport,int VoiceChannel,
void* M,int Length,
SOUND_FORMAT Format);
Return Values
LV_SUCCESS - Voice channel audio successfully loaded.
LV_SYSTEM_ERROR - The speech recognition engine is no longer running.
This is the result of a LV_SRE_ClosePort call or a unrecoverable engine
error.
LV_FAILURE - Sound format was incorrectly specified or the sound file
specified could not be opened.
Parameters
VoiceChannel – The Voice Channel, in a the range [0-63]. Accepted values 0
through 63.
M - Void pointer to audio data.
Length - The size in bytes of the audio data.
Format - The sound format of the audio data.
ULAW_8KHZ,
PCM_8KHZ,
PCM_16KHZ
Putting it together
Once the Grammar and the VoiceChannel have been filled out, the Decode can take place. The LV_SRE_Decode will send the grammar and the audio data from the voice channel into the recognizer. It may send it more than once if the gender has not been specified. When the recognition is complete the result are stored in the VoiceChannel.
LV_SRE_API int LV_SRE_Decode(HPORT hport,int VoiceChannel,
int GrammarSet,
unsigned int flags)
Return Values
A positive result indicates the number of concepts matched in the grammar.
A negative result is an error code.
If LV_DECODE_BLOCK is NOT used then this function returns 0 (LV_SUCCESS) on no errors.
> 0 indicates the number of concepts matched in the grammar.
< 0 indicates an error code.
= 0 indicates success, only if LV_DECODE_BLOCK is NOT used
Parameters
hport - Handle to the SpeechPort.
VoiceChannel - The voice channel to process.
GrammarSet - The grammar to use to process this decode.
Flags (note: OR'ing the flags together is allowed)
LV_DECODE_USE_OOV - use out-of-vocabulary filter.
LV_DECODE_BLOCK - wait until results are available
LV_DECODE_GENDER_MALE - Gender identifier
LV_DECODE_GENDER_FEMALE – Gender identifier
LV_DECODE_FIRST_TIME_USER – resets caller weights in
Recognition Engine.
void PerformDecode(HPORT hPort, void* AudioData, long size, long
format, long GrammarSet)
{
LV_SRE_LoadVoiceChannel(hPort, 1, AudioData, Size, format);
LV_SRE_Decode(hPort, 1, GrammarSet,
LV_DECODE_USE_OOV | LV_DECODE_BLOCK );
}
Checking the Results
After the Decode has run, it is time to check the results. The call to LV_SRE_GetNumberOfConceptsReturned will give the number of concepts that were found in the audio data. If no concepts are returned, then that is considered a ‘no match’. Often the audio data needs to be collected from the speaker again if a no match occurs. There are different resolutions to look at the results with, from the concept down to the phoneme block selected.
int LV_SRE_GetNumberOfConceptsReturned(HPORT hport,
int VoiceChannel);
Return Values
Returns the number of concepts found for this VoiceChannel.
Parameters
hport - Handle to the SpeechPort.
VoiceChannel - This is the voice channel that was processed by Decode.
Once the number of concepts is obtained, you can move through the concept list pulling out values for each entry. Each concept identified can be evaluated in many ways. In addition to getting the concept found, the following can also be retrieved: confidence score, phrase, raw text and phonemes.
In most cases the concept is all the application need to know. Does it matter if the speaker said yes please instead of just yes? In most cases no it really is a minor detail. After the decode takes place the application will ask for the number of concepts and then move through them one at a time performing whatever evaluation is appropriate.
const char* LV_SRE_GetConcept(HPORT hport,int VoiceChannel,
int Index);
Return Values
This function will return a null terminated string of the matched concept . A
NULL value will indicate that Index was outside the possible range.
Parameters
hport - Handle to the SpeechPort.
VoiceChannel - This is the voice channel that was processed by Decode.
Index - The recognition position of the concept to retrieve. Valid values include 0
to return value of LV_SRE_Decode -1 (or LV_SRE_GetNumberOfConceptsReturned - 1).
Example snippet:
void SPExampleYesNoRecognition(void* AudioData, long size,
long format)
{
HPORT hPort;
CMyData* pMD = new CMyData;
const char * pcszConcept;
hPort = LV_SRE_OpenPort(SPLogging, pMD, 3);
LV_SRE_AddPhrase(hPort, 1, “no”, “no”);
LV_SRE_AddPhrase(hPort, 1, “no”, “nope”);
LV_SRE_AddPhrase(hPort, 1, “yes”, ”(yes [please]) | correct”);
LV_SRE_LoadVoiceChannel(hPort, 1, AudioData, Size, format);
LV_SRE_Decode(hPort, 1, 1,
LV_DECODE_USE_OOV | LV_DECODE_BLOCK );
int nCount = LV_SRE_GetNumberOfConceptsReturned(hPort, 1);
for(int i = 0; i > nCount; i++)
{
pcszConcept = LV_SRE_GetConcept(hPort, 1, i);
if(!stricmp(pcszConcept, “yes”)
//they gave an affirmative answer
else
//they gave negative response
}
LV_SRE_ClosePort(hPort);
}
The confidence score is an evaluation of how confident the recognition engine is on its choice. If there is a low confidence present, then the application may want to confirm the selection.
int LV_SRE_GetConceptScore(HPORT hport,int VoiceChannel,
int Index);
Return Values
Returns the confidence score of the matched concept. The range of possible
values is 0 to 1000.
Parameters
hport - Handle to the SpeechPort.
VoiceChannel - This is the voice channel that was processed by Decode.
Index - The recognition position of the concept to retrieve. Valid values include 0
to return value of LV_SRE_Decode -1 (or LV_SRE_GetNumberOfConceptsReturned - 1).
const char * pcszConcept;
int nScore
int nCount = LV_SRE_GetNumberOfConceptsReturned(hport, 1);
for(int i = 0; i > nCount; i++)
{
pcszConcept = LV_SRE_GetConcept(hport, 1, i);
nScore = LV_SRE_GetConceptScore(hport, 1, i);
if(nScore < 300)
{
//Go confirm
}
else
{
//good concept, evaluate and take action.
}
}
Beyond the concept and score, you may want to know which phrases are being selected. If you have six different phrases for a concept but two of the phrases are never spoken, those should be removed because they are simply dead weight. The time it takes to decode is directly proportional to the size of the audio data and the number of phrases. Which is further extended down to the phonetic block variations for that phrase. The long and short of it is try to keep your grammars as small as you can. This will result in better accuracy and faster performance. The following functions help you evaluate what is actually being identified.
Returns a static string of the decoded phrase
const char* LV_SRE_GetPhraseDecoded(HPORT hport,
int VoiceChannel, int Index);
Returns a static string of the decoded raw text
const char* LV_SRE_GetRawTextDecoded(HPORT hport,
int VoiceChannel, int Index);
Returns a static string of the phonemes.
const char* LV_SRE_GetPhonemes(HPORT hport,
int VoiceChannel, int Index);
Return Values
A static string.
Parameters
hport - Handle to the SpeechPort.
VoiceChannel - The voice channel to process.
Index - The recognition position of the item to retrieve.
Phrase – This is the specific phrase that was recognized in the Audio Data. This will
exactly match the string entered in the LV_SRE_AddPhrase.
Raw Text – Raw Text has 2 situations where it gives better resolution than phrases.
1. The phrase is a BNF. Raw text will give the actual variation that was identified.
2. When Phonemes were entered into the LV_SRE_AddPhrase as a phrase and a label was inserted with it, Raw text will show the label instead of the Phoneme block.
Phrase = [{Y AH P: Yup}| yes] ( I Did [it]) | ([That’s] correct)
audio contains = Yup I did.
Raw Text = Yup I did.
Phonemes – This is the actual Phoneme Blocks that make up the sounds identified.
Each word in a phrase can have multiple phoneme variations. This is to support different dialects.
More Bits and Pieces
When performing a Decode the default is to return immediately. This allows the application to perform other tasks while the Speech Engine is working. The LV_SRE_WaitForEngineToIdle call is the function to use to sync with the Recognition Engine. When called control will not be released back to the application until the results are ready. To some extent this function is not required in most cases. Any function call that retrieves results will also perform the same functionality.
int LV_SRE_WaitForEngineToIdle(HPORT hport,
int voicechannel,int ms);
Return Values
LV_SUCCESS - The command completed successfully.
LV_TIME_OUT - The time to wait elapsed. The engine is not idle yet.
Parameters
MillisecondsToWait - The number of milliseconds to wait before
returning if the Speech Port does not become idle.
VoiceChannel - This parameter specifies which VoiceChannel to wait on, -1
indicates to wait on all the VoiceChannels for the port.
The API currently only has one property that can be set. This property PROP_SAVE_SOUND_FILES will write 2 binary files to the hard disk. The Request File contains the grammar, audio data and the acoustic model to decode with. The Response File contains the results of the Decode and post processed audio. These files can help determine what actually happened offline. It is a very complete picture of the decode process.
int LV_SRE_SetProperty(HPORT hport,PROPERTIES Property,
int Value);
Return Values
LV_SUCCESS - The command completed successfully.
LV_BAD_HPORT - The hport was invalid.
LV_NOT_A_VALID_PROPERTY_VALUE - Value is invalid for the given
property.
Parameters
Hport - The port's handle.
Property - Indicates which property to modify.
Value - Varies depending on Property.
The API has a convenience function that returns the same value as the FormatMessage Windows function.
const char* LV_SRE_ReturnErrorString(int ReturnCode);
Return Values
A static string
Parameters
ReturnCode - A return code from another function.
Some error messages are not related directly with any one SpeechPort. To receive these messages use the following function
void LV_SRE_RegisterAppLogMsg(ExportLogMsg log,void *p,
int verbosity);
Return Values : none
Parameters
Log - Pointer to a function that will receive logging information.
p - a void pointer to Application defined data. This data will be passed into the ExportLogMsg function to identify the application.
verbosity - range: 0 – 6
0 - minimal logging info
6 - maximum logging info
Complete Help Topic List | Speech Engine Product Information