Listening, and not just hearing, with the Kinect


Speaking of speech and the Kinect, Leland Holmquest is back with another MSDN Magazine article talking about it...

Listening with Kinect

In previous articles, I walked you through the capabilities you can tap into when developing Kinect for Windows applications. You have seen how to use the skeleton tracker, how to create test scenarios and how to enable 3D sight with the depth sensor. In this article, I explain how to use the Kinect hardware for speech recognition. Those of you who are already familiar with the Speech.Recognition library will notice many similarities between that and Microsoft.Speech, which Kinect uses, and should feel right at home. Let’s begin by looking at those two libraries.

A Tale of Two Libraries

Speech recognition isn’t new. For some time, Windows developers have been connecting microphones and other hardware to their Windows machines to listen and respond to human speech. Using the System.Speech library, applications can build grammars, direct where to “listen” from, respond with semantic correctness and even take dictation with increasing accuracy and ease. Now enters the Kinect for Windows SDK with the Microsoft.Speech library—but why do we need a new Speech library?

On a closer look, the System.Speech and Microsoft.Speech libraries appear comparable. In fact, they are a lot alike: they have the same classes, the same methods and so on. The main difference between these two libraries is that Microsoft.Speech is specifically optimized for the Kinect hardware. Although using System.Speech with the Kinect hardware or, conversely, using Microsoft.Speech with a regular microphone is possible, in both cases the results are less than optimum. (We’ll actually try this later in the article and see how it works.)

The other major distinction between the two libraries is that as of Kinect for Windows SDK 1.6, Microsoft.Speech doesn’t support the dictation model. The recognition engine doesn’t support DictationGrammar. That means that to use the speech recognition aspect of the SDK, you must know all the spoken values and incorporate them into a fixed grammar. Although this constraint is serious, Microsoft.Speech still gives you an enormous amount of freedom to interact with users via speech. Later in this article, I’ll return to the concept of dictation with a little experiment, but first, let’s go over the basics of speech recognition for the Kinect.


What Do You Have to Say?

In this article, I demonstrated some of the techniques and tools available when using the Kinect for Windows sensor to facilitate speech recognition. I showed how you can define a grammar within code as well as in an XML document. I also attempted to use Kinect to receive dictation. By using both Microsoft.Speech and System.Speech in conjunction with the Kinect for Windows SDK (1.6), I was able to get some decent results with limited effort.

Using these techniques opens the door for some really creative user experiences. Couple the speech-recognition capabilities with the depth and skeleton-tracking capabilities, and you can provide Kinect users with even more sophisticated experiences. For example, you could have the speech recognizer respond only if the user is looking directly at the Kinect, more closely simulating humanistic eye contact.


Project Information URL:

Project Source URL:

Contact Information:

The Discussion

  • User profile image
    james  braselton

    hi there i played kinect at best buy demo unit i want kinect

  • User profile image
    Jean Philippe Encausse

    Very good Article !

    I will fix in v2.5 of SARAH ( my code to use Microsoft.Speech instead of System.Speech.

    The result is really better ! even if I loose dictation :-(

    In fact the worst things is loosing method getAudioForWord(). I was using it to perform SpeechToText with Google API on <ruleref="garbage">.

    I confirm that Gesture + Voice + QRCode is a really good experience :-)

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.