Using the Kinect for Winnows SDK v1.8 to add speech recognition to your next web app...


Today's post by Eddy Escardo-Raffo provides more light on one of the new features the Kinect for Windows SDK v1.8, and showing off in a way that you might now expect...

Using Kinect Webserver to Expose Speech Events to Web Clients

In our 1.8 release, we made it easy to create Kinect-enabled HTML5 web applications. This is possible because we added an extensible webserver for Kinect data along with a Javascript API which gives developers some great functionality right out of the box:

  • Interactions : hand pointer movements, press and grip events useful for controlling a cursor, buttons and other UI
  • User Viewer: visual representation of the users currently visible to Kinect sensor. Uses different colors to indicate different user states
  • Background Removal: “Green screen” image stream for a single person at a time
  • Skeleton: standard skeleton data such as tracking state, joint positions, joint orientations, etc.
  • Sensor Status: Events corresponding to sensor connection/disconnection

This is enough functionality to write a compelling application but it doesn’t represent the whole range of Kinect sensor capabilities. In this article I will show you step-by-step how to extend the WebserverBasics-WPF sample (see C# code in CodePlex  or documentation in MSDN) available from Kinect Toolkit Browser to enable web applications to respond to speech commands, where the active speech grammar is configurable by the web client.

A solution containing the full, final sample code is available on CodePlex. To compile this sample you will also need Microsoft.Samples.Kinect.Webserver (available via CodePlex and Toolkit Browser) and Microsoft.Kinect.Toolkit components (available via Toolkit Browser).

More specifically, on the server side we will:

  1. Create a speech recognition engine
  2. Bind the engine to a Kinect sensor’s audio stream whenever sensor gets connected/disconnected
  3. Allow a web client to specify the speech grammar to be recognized
  4. Forward speech recognition events generated by engine to web client
  5. Registering a factory for the speech stream handler with the Kinect webserver

This will be accomplished by creating a class called SpeechStreamHandler, derived from Microsoft.Samples.Kinect.Webserver.Sensor.SensorStreamHandlerBase. SensorStreamHandlerBase is an implementation of ISensorStreamHandler that frees us from writing boilerplate code. ISensorStreamHandler is an abstraction that gets notified whenever a Kinect sensor gets connected/disconnected, when color, depth and skeleton frames become available and when web clients request to view or update configuration values. In response, our speech stream handler will send event messages to web clients.

On the web client side we will:

  1. Configure speech recognition stream (enable and specify the speech grammar to be recognized)
  2. Modify the web UI in response to recognized speech events

All new client-side code is in SamplePage.html


Party Time!

At this point you can rebuild the updated solution and run it to see the server UI. From this UI you can click on the link that reads “Open sample page in default browser” and play with the sample UI. It will look the same as before the code changes, but will respond to the speech phrases “Show”, “Show Panel”, “Hide” and “Hide Panel”. Now try changing the grammar to include more phrases and update the UI in different ways in response to speech events.

Project Information URL:

Project Download URL: [URL]

Project Source URL:


Contact Information:

The Discussion

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.