For a long time I've wanted to experiment over the .NET speech recognition and synthesis capabilities. I've seen the new vista speech recognition features and it sounds pretty robust, aside from negative experiences from our guys at Microsoft Redmond... This article will provide a simple tutorial on Speech Recognition and Synthesis with the .NET Framework.
There are a few projects I'm interested in doing for my new appartment, for controlling electrical devices and my Media Center, and solid speech recognition is indeed a very interesting and usefull feature.
WIndows XP and Windows Vista both have speech recognition engines built in. The main difference between them is the version. WIndows XP has SAPI version 5.1, while Windows Vista is version 5.3. SAPI stands for Speech API and it's a full set of COM objects for speech purposes. You can program over this if you want, it will sure allow you to go deeper in some areas, but for the general purpose, managed code will do perfectly.
And it is in managed code that we find the two following namespaces: System.Speech.Synthesis and System.Speech.Recognition. And, as the names themselves imply, the first is where you'll find all the objects related to speech synthesis, or, in other words, what will make you able to generate an artificial voice that reads your text. The second, System.Speech.Recognition, allows you to define grammars and vocabularies in order for a Speech Recognition Engine to detect vocal patterns and recognize words and phrases.
using System.Speech.Recognition;
using System.Speech.Recognition.SrgsGrammar;
using System.Speech.Synthesis;
In order to start making some experiments, I started by creating a new Windows Presentation Foundation project and included the two namespaces. Here's my objective: to make an app that can recognize my commands, and respond vocally to them. As soon as my application starts, I want it to wait for my commands, recognize them, and then take the appropriate measures.
I must first specify my grammar. There are two ways to do this: using the builtin .NET object model, or by creating an XML document that obeyes the W3C Speech Recognition Grammar Specification 1.0. Although I ended up choosing the second approach (I always like to learn things from the ground up) I also tested the object model.
Basically, what this kind of SRGS XML does, is specify a grammar tree with all the possible words/commands and phrase sequences, provide runtime information to the application when there is a recognition, and specify languages and semantics types. Here's an example:
<?xml version="1.0" ?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN" "http://www.w3.org/TR/speech-grammar/grammar.dtd">
<grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd"
tag-format="semantics-ms/1.0" version="1.0" mode="voice" root="commandRoot">
<!-- For more info on SML and tags, check http://msdn.microsoft.com/en-us/library/ms870098.aspx -->
<meta name="author" content="Vasco Oliveira"/>
<rule id="commandRoot" scope="public">
<tag>$.commandId={}; </tag>
<one-of>
<item>
<ruleref uri="#command" />
</item>
<item>
<ruleref uri="#greet" />
</item>
</one-of>
<tag>$.commandId = $$.commandId; </tag>
</rule>
<rule id="greet">
<item repeat="0-1">Hello computer</item>
<tag>$.commandId = 0; </tag>
</rule>
<rule id="command">
<ruleref uri="#action"/>
<ruleref uri="#object"/>
<tag>$.commandId = $$.commandId; </tag>
</rule>
This block of XML code starts by specifying an grammar in english language, tag-format as "semantics-ms/1.0" for providing app information at runtime, mode as "voice" (this could also be DTMF) and the root rule for the speech parser to start.
The base rule is "commandRoot" and it branches to one of two types of rules (the "command" rule and the "greet" rule) through the <one-of> node type. What this means is, when a phrase or word is spoken, the engine will follow one of these two paths until full rule recognition is accomplished. Since we are referencing a rule we use <ruleRef>, while rule specification is created with <rule> node types, like the "greet" rule. The "greet" rule specified in this XML sample, consists of the phrase "Hello computer", and the repeat attribute says that it can be repeated zero to one times. On the other hand, the tag node allows a script block used to manipulate SML information that can be captured by the application. My example will use this, so I'll explain how later in this article. By the way, SML stands for Semantic Markup Language.
When we're developing speech recognition applications for windows Vista or XP there are two ways you can setup your application, regarding the speech recognition engine. You can use your own InProc speech processor, which is application-exclusive, or you can use the shared engine, used by the OS, that provides access to run any properly installed speech recognition services found on a Windows Desktop system. In windows Vista, for example, this mode will open the speech tab, and you application will share it's grammar/dictionary with the operative system. If you go for the first option, you should use the SpeechRecognizer object, while the shared approach must use SpeechRecognitionEngine.
This is how the code goes... On Load, first, load the grammar file, and set the event handler to run when there is a speech recognition over it:
byte[] grammarBinary = Properties.Resources.Grammar;
MemoryStream stream = new MemoryStream(grammarBinary);
stream.Position = 0;
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
XmlReader xr = XmlReader.Create(stream, settings);
GrammarDocument = new SrgsDocument(xr);
ActiveGrammar = new Grammar(GrammarDocument, "commandRoot");
ActiveGrammar.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(SpeechRecognized);
Next, setup the SpeechRecognitionEngine object:
Recognizer = new SpeechRecognitionEngine();
Recognizer.SetInputToDefaultAudioDevice();
Recognizer.UnloadAllGrammars();
Recognizer.LoadGrammar(ActiveGrammar);
Recognizer.RecognizeAsync(RecognizeMode.Multiple);
And the Synthesizer object to make the computer speak:
// Set speech synthesizer and save all installed voices on the OS
// Windows Vista comes with Microsoft Anna installed.
Synthesizer = new SpeechSynthesizer();
Voices = Synthesizer.GetInstalledVoices();
Finally, we must specify our event to handle the recognitions. For this case, I'm making the computer say the time, date, and also greet:
private void SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
RecognitionResult res = e.Result;
if (e.Result.Semantics != null)
{
foreach (KeyValuePair<String, SemanticValue> child in e.Result.Semantics)
{
switch (int.Parse(child.Value.Value.ToString()))
{
case 0:
Synthesizer.SpeakAsync(Properties.Resources.GreetOwner);
break;
case 1:
Synthesizer.SpeakAsync("It's " + DateTime.Now.Hour + " hours, and " + DateTime.Now.Minute + " minutes.");
break;
case 2:
Synthesizer.SpeakAsync("Okay.");
break;
case 3:
Synthesizer.SpeakAsync(string.Format(Properties.Resources.GreetOther, e.Result ));
break;
default:
break;
}
}
}
}
With this much of code your application is recognizing commands, and responding verbally to them. Try saying "Computer, say the date" and see what happens.
We could also specify XML files for the computer to read. This brings along another XML specification called Speach Synthesis Markup Language (SSML), that is something like this:
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN" "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US">
<p>
<s>Hello <prosody rate="-40%">Vasco</prosody >. You have 2 new messages. Do you wish me to read them? <break time="3s"/> Okay...</s>
<s>The first is from <prosody rate="-40%">Rachel Blacksmith</prosody>, and arrived at 3:45 pm.</s>
<s><prosody rate="-40%">The subject is ".NET"</prosody></s>
</p>
</speak>
This file specifies pauses, tone, pitch among the sentences to be said. Check the link to understand whar the node types mean. To use this file, we just need to prep our Synthesizer object like this:
Synthesizer.SetOutputToDefaultAudioDevice();
PromptBuilder pb = new PromptBuilder();
pb.AppendSsml("SSML\\Welcome.xml");
Synthesizer.SpeakAsync(pb);
Enjoy these two excellent namespaces!