MAVIS uses vocabulary search, where it knows about the actual words and the context they're used in, although it works using senones, even smaller chunks of speech than phonemes. It stores multiple possible recognitions, complete with the probability that each recognition is the right one (did you say "speech" or "beach" or even "peach"?). And if it comes across words it doesn't know, it looks them up on Bing.
That all makes it easier to tell the difference between someone talking about history and saying 'the Crimean war' and someone talking about politics and mentioning 'crime in a war', explained Behrooz Chitsaz, the Microsoft Research director who talked about MAVIS at the MIX conference back in 2010.
MAVIS is also behind the Department of Energy's ScienceCinema search and it's been searching video for the British Library, NASA, and state archives in both Georgia and Washington. Wondering why a particular bill was passed? How about jumping straight to the exchange in the debate that tipped the balance...
Those systems were built using the MAVIS APIs for SharePoint and SQL Server, which let you build up a multimedia archive on your intranet and use MAVIS to index it.
MAVIS always used Azure to run its speech recognition; that's how Green Button (which Microsoft bought recently) was about to build a cloud video indexing service on it. A couple of years ago MAVIS switched to using the same type of deep learning neural networks that are behind the live speech translation Microsoft plans to launch for Skype later this year. Deep learning is being used at Google and Facebook and it's the hot new technique in machine learning, so it's interesting that Microsoft was the first company to put deep learning-driven voice recognition in a product back in 2012, even if it was one you needed to be a research lab or state to afford.
The combination of the massive data centers you need to run a cloud service and the economics of SaaS subscriptions means that kind of high-powered tool is becoming far more widely available. With the Azure Media Indexer, you can skip building your own media archive completely and put it in the cloud instead.
When did I say that?
The voice recognition in the Indexer is not only useful for videos; it would also work just as well for searching your voice mail or recordings of meetings and conference calls. You could use it to jump to the point in a match where someone scores a goal, or to the bit in the meeting where someone says something useful, or to check your voice mail for the call from the auto repair shop that you're waiting for without having to listen to all your other messages first.
Sign up for CIO Asia eNewsletters.