Frank Seide - Video Search (Microsoft Research Asia)

Play Frank Seide - Video Search (Microsoft Research Asia)

The Discussion

  • User profile image
  • User profile image

    This is very exciting to see the use of speech recognition technology used in an application.The search was faster than what I was expecting. Does it use a cash system, like a web based search engine would use? What was the memory size of all the video?

    You mentioned a speech recognizer. What is the process for compiling a speech recognizer?

    Can channel 9 feature more videos like this one, applications using speech recognition technology? I can only imagine what other things are being worked on.  Answering machines, automated appliances operating on voice commands? I can turn the TV on with out having to use a remote control, and navigate to my desired channel just by speaking out loud the channel number. (Or as seen in your video: just by telling the TV the subject matter I want to search for).  Does this mean I’ll eventually be able to talk to my car as though I talk to a person (granted the AI is there)? For those times I lock my keys in the car. I would say “George“(because George is the name I would give my car). “George please un-lock my door”! How about navigating my computer, I would like to talk to my computer rather than type all the time. How about an application for gaming? Xbox 360? It would be great to have team based game play where you’re able to audibly give commands to your AI team mates via head set and mike. some of the menu displays for that stuff gets so clutterd and in the way. Im qualified to asist on developing a computer game with the application of speach recognition. =)  Pushing the buttons on my automatic money teller machine is so out of date; the buttons are always so dirty to. “I would like to deposit 10 dollars please” then I would place my eye up to the scanner rather than type in my pin number. So much faster. No more having to carry a whole wallet full of cards. Do you know how uncomfortable it is to sit on a pile of magnetic cards each day at the office? Will I be able to order a pizza with out having to employ a person to stand by the phone and wait for me to place my order? This technology will be great for phones.

    My favorite will be the international language translator. A devise to be carried by a person to translate the words some one speaks into any given language the user specifies. It would do wonders for Americans vacationing in Paris. The tablet pc would be a nice transport device for this software.  With every one having a tablet pc to carry around with them, then all the world powers would have no excuses for not understanding one another and they would then all get along with one another. I'm a very optomisitic person.

    It is a very exciting time for this technology. Thank you for sharing the video. I hope to see more like it and have my questions answered.

  • User profile image


  • User profile image
    Thanks for your comment.

    I've just asked Frank Seide to reply your answer regarding video-search technology if possible.  FYI, the demo system stored Channel 9 videos, so Frank said "I will search my video by the system" when I shoot the video in Tokyo, Japan.

    Anyway, this video-search technology is very interested for many people, I think.  Also there are a lot of studies regarding speech recognition technology on Microsoft Research Asia.

  • User profile image

    Hi Jason,  let me answer a few of your questions.

    First, yes, we use an index during the search like a web engine. The actual search only takes a second. The long delay in the video was because we sent the search result from the server to the TV client in a rather inefficient way (as researchers we often resort to rapid prototyping techniques which are fast to implement but not always fast in execution).

    It is impossible to perform all the signal and speech processing for 2500 hours of audio in such a short time. It takes many days to do that on a server farm with tens of computers.

    The memory footprint of the searcher is very small. A real web video search engine would include so much video that the index would be too big to fit into memory. Thus the search function reads all index data directly from disk during searching.

    A speech recognizer is a complicated system consisting of signal processing, machine learning, and fast pattern matching algorithms. In a nutshell, we use machine learning techniques to learn from millions of examples how each sound ("phoneme") of a language sounds like ("acoustic-phonetic model"), we add a dictionary that lists for each word of the languages how it is made up of these phonemes, we also include some form of grammar ("language model"), and at recognition time a complicated process ("Viterbi algorithm") matches incoming audio against these models and outputs the most likely match.

    If you are a programmer and interested in trying to use speech recognition in programs, please download the Microsoft SAPI SDK. If you are interested in the scientific aspect of speech recognition, how it works inside, you can go to the HTK web site of Cambridge University, England, where you can download code and an excellent tutorial.

    Regarding the plenty applications you describe, pretty much all of them are already being worked on in research labs around the world. The furthest out is the AI component. The current hard problems being looked at are robustness of speech recognizers to background noise, accents, speaking styles etc.

    Hey, thanks for your comments, and I am happy you liked the video!

    Frank Seide

Add Your 2 Cents