Hi Jason,  let me answer a few of your questions.

First, yes, we use an index during the search like a web engine. The actual search only takes a second. The long delay in the video was because we sent the search result from the server to the TV client in a rather inefficient way (as researchers we often resort to rapid prototyping techniques which are fast to implement but not always fast in execution).

It is impossible to perform all the signal and speech processing for 2500 hours of audio in such a short time. It takes many days to do that on a server farm with tens of computers.

The memory footprint of the searcher is very small. A real web video search engine would include so much video that the index would be too big to fit into memory. Thus the search function reads all index data directly from disk during searching.

A speech recognizer is a complicated system consisting of signal processing, machine learning, and fast pattern matching algorithms. In a nutshell, we use machine learning techniques to learn from millions of examples how each sound ("phoneme") of a language sounds like ("acoustic-phonetic model"), we add a dictionary that lists for each word of the languages how it is made up of these phonemes, we also include some form of grammar ("language model"), and at recognition time a complicated process ("Viterbi algorithm") matches incoming audio against these models and outputs the most likely match.

If you are a programmer and interested in trying to use speech recognition in programs, please download the Microsoft SAPI SDK. If you are interested in the scientific aspect of speech recognition, how it works inside, you can go to the HTK web site of Cambridge University, England, where you can download code and an excellent tutorial.

Regarding the plenty applications you describe, pretty much all of them are already being worked on in research labs around the world. The furthest out is the AI component. The current hard problems being looked at are robustness of speech recognizers to background noise, accents, speaking styles etc.

Hey, thanks for your comments, and I am happy you liked the video!

Frank Seide