Sitting down with the Kinect and the Media Center
- Posted: Mar 16, 2012 at 6:00 AM
- 6,522 Views
There's a number of cool things with the free binary download. While there's no source to peek at, the Steve and the team at amuletdevices.com have written up a couple great blog posts that discuss their project in a fair amount of detail and depth, while at the same time talking about how they solved a commonly asked problem, using the Kinect while sitting down (or even laying down on the couch...)
For anyone struggling with using the Kinect while sitting, I've made some progress with identifying postures while sitting and laying down.
The skeletal recogniser often looses tracking in that situation, so instead I'm using HaarCascades from OpenCV with the Kinect depth camera. I have a blog entry with some background here
The bottom line is , it works!
There are some videos on the blog so you can see the results.
Project Download URL: http://www.amuletdevices.com/index.php/Features/kinect.html
You may have previously seen in videos, early versions of the Amulet software being used with Kinect, now the music functionality is finished and we’re ready to give it away. If it proves popular then we’ll release a paid version that will include all the other functionality currently in our “Amulet Voice Remote” product and more…
I thought I’d make a posting to the Amulet blog to explain a bit of the background behind how and why “Amulet Voice Kinect” works the way it does. (click on read more to see the rest of the post)
The design of Amulet Voice Kinect addresses two issues that can have a negative effect on the usability of speech recognition solutions.
- The first issue particularly effects speech control of media and it’s where the media you’re playing is loud enough that it swamps the user’s voice and so you lose control over the system.
- The second issue is where noise or unintended speech produces a “false positive” and so the system does something that you didn’t tell it to do.
Our existing product, the “Amulet Voice Remote” solves these two problems by ensuring that a user’s mouth is close to a microphone by virtue of the mode of operation being to tilt the remote and raise it to your lips, this ensures the system only listens when the user is talking (remote has to be tilted) and that the users voice can’t be swamped by environmental noise or loud music as the mic will be very close to the users mouth any time it’s being used.
Amulet Voice Kinect doesn’t have the benefit of a close proximity mic or an on/off switch. The Microsoft Kinect that it’s used with, does have some very impressive echo cancellation technology built-in and if you keep to playing your media at moderate volume levels it works well, but in practice it’s all too easy to get into the situation where a particular song is too loud and the system can’t hear you and you can’t recover without resorting to a keyboard or remote control.
So we have addressed that issue in AVK by dipping the music volume when the user is talking. How does the system know the user is talking? he or she signals it by raising his hand (left hand in this case). This also addresses the second issue, because if the user hasn’t signalled that he’s giving a command then the system isn’t listening and there’s zero chance of picking up any false positives.
So I figured the thing to do would be to use some of those same computer vision routines with the Kinect depth camera. The depth camera works in the dark, as instead of measuring pixel light intensity it uses IR and outputs pixel distance, so you essentially get the shape of someone, even in complete darkness. As I’ve seen HaarClassifiers (or ”A cascade of boosted classifiers based on Haar like features”) from OpenCV used very effectively to recognise faces, but I couldn’t find any evidence of people using them with a depth stream, I decided to try that, you can read more details in part2 of this blog article.
So after deciding that I wanted to see if HaarCascades would work at all with the depth-stream from Kinect, I figured I needed to give the first test the best chance of success. There was no point having a half-hearted effort, have it fail and then have to have another go because I wouldn’t have known if the failure was down to some fundamental problem with using the depth-stream with HaarCascades or just because I had poor HaarClassifiers. To give the test the best chance of success I would need several thousand depth images containing the object that I needed the classifier to recognise and several thousand negative images not containing the object. I settled on 2000 of each. (click on read more to see the rest of the post)
Before running the first test though I figured that using the raw depth data might be problematic as it contains a lot of noise (due to reflective surfaces) and IR shadows (produced as a result of the IR projector and IR sensor positions on the Kinect being offset).
Pixels that are part of this noise return an ”undefined” depth value, my first effort at cleaning up the images involved feeding the images to a buffer but in subsequent frames only updating those pixels that were not undefined. This gives a much better looking depth picture as noisy undefined pixels from reflective surfaces will often have a sporadic good value, so the good value persists from earlier frames as it’s preferred to the undefined value. This works very well in removing noise and some shadows but a big drawback is that if you have a large area of undefined pixels such as those that are further away than the max range of the Kinect, you get smearing. So if I stand in front of a wall that’s within the max range of the Kinect I get a nice depth pic with minimal noise, but if the wall behind me is out of range and I wave my hands around they leave trails on the pixels that are undefined (as undefined pixels retail there values from the last time they were defined). I had to abandon that scheme as I presumed the classifiers would have problems if they were fed with bitmaps that contained smearing.
There are a number of tutorials on the internet that show how to use HaarCascades so I won’t rehash the details here but in general the procedure entails marking out the “timeout” pose in positive images with Objectmarker.exe, as there’s two thousand images in this test, it’s a very monotonous task, one tip for using Objectmarker is if you don’t like the default keyboard and mouse button combination you can override them with “X-Mouse Button Control”. The result of using Objectmarker is a text file that contains a list of a filenames and coordinate pairs. The coordinates are where you identified each pose within each image. A separate text file is created that lists the filenames of each of the two thousand negative images. Then Createsamples from OpenCV is used to make a .vec file using the two previously created text files. That vector file can then be used by HaarTraining to produce the cascade which can then be turned into an actual classifier with Haarconv.exe. That classifier (in xml format) can then be used in a software app with HaarCascade.detect to determine if the pose is in a live frame coming from Kinect.
The HaarTraining uses a lot of CPU cycles, in my experience it can take anything from a day to a week to produce the classifier. I set the number of desired stages to 30 and let it process, you can stop the process at any time and resume it, so after about 24 hours it had completed 11 stages and I tried it out. It worked! There were some false positives, so something that looked nothing like a person making a “T” shape to the naked eye would trigger it but essentially it worked. The false positives can be removed by using a number of methods, you can make the cascade longer add hysteresis to the detection routine and bump up the required adjacent frames in HaarCascade.detect.
After letting the training run for 5 days it produced a 24 stage cascade, this had far less false positives. I’ve found that the number of stages used in this particular application needs to be a balance between using a large number to keep the classifier accurate but small enough that it doesn’t use too much CPU when you use it in your app, because the HaarCascade.detect is run during each depth frame that comes from the Kinect, it needs to be able to keep up. One tip I have for anyone trying this at home! Is that OpenCV V1.0 uses multithreading to a far greater extent that later versions, I’m able to max out 8 CPU cores while using HaarTraining from V1, later versions are not able to do that and it makes a big difference. So I used just the HaarTraining from V1 but used the HaarCascade.detect from the latest version. Another useful tip is to randomise the order of the positive and negative bitmaps, this seemed to give me better results at an earlier stage, when training.
Amulet Voice Kinect lets you use Microsoft Kinect to control your music on a Windows 7 PC. It has these features:
- Access your Windows Media Center music directly using simple voice commands
- Use gestures to turn on and off speech recognition from the comfort of your couch
Amulet Voice Kinect is a free download with no restrictions on use.
Amulet Voice Kinect also lets you:
- See visual feedback for both audio and gesture commands within Media Center
- Use voice to navigate around other sections of Media Center
- Add additional custom commands to perform other actions (advanced users only)
What you need
- A Windows 7 PC with at least 4 GB of RAM and a dual core 2.66 Ghz processor
- A Microsoft Kinect