Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Microsoft brings video voice recognition for everyone

Mary Branscombe | Sept. 12, 2014
Azure Media Services is something Apple might want to consider for streaming its next keynote, rather than rolling its own system on Amazon Web Services and Akamai. It's what big-name broadcasters used to stream the 2014 Winter Olympics and the 2014 World Cup, it's what powers the Blinkbox streaming video service, and if you watched the Xbox One announcement you've already used it, so it's certainly proved its reliability.

MAVIS uses vocabulary search, where it knows about the actual words and the context they're used in, although it works using senones, even smaller chunks of speech than phonemes. It stores multiple possible recognitions, complete with the probability that each recognition is the right one (did you say "speech" or "beach" or even "peach"?). And if it comes across words it doesn't know, it looks them up on Bing.

That all makes it easier to tell the difference between someone talking about history and saying 'the Crimean war' and someone talking about politics and mentioning 'crime in a war', explained Behrooz Chitsaz, the Microsoft Research director who talked about MAVIS at the MIX conference back in 2010.

MAVIS is also behind the Department of Energy's ScienceCinema search and it's been searching video for the British Library, NASA, and state archives in both Georgia and Washington. Wondering why a particular bill was passed? How about jumping straight to the exchange in the debate that tipped the balance...

Those systems were built using the MAVIS APIs for SharePoint and SQL Server, which let you build up a multimedia archive on your intranet and use MAVIS to index it.

MAVIS always used Azure to run its speech recognition; that's how Green Button (which Microsoft bought recently) was about to build a cloud video indexing service on it. A couple of years ago MAVIS switched to using the same type of deep learning neural networks that are behind the live speech translation Microsoft plans to launch for Skype later this year. Deep learning is being used at Google and Facebook and it's the hot new technique in machine learning, so it's interesting that Microsoft was the first company to put deep learning-driven voice recognition in a product back in 2012, even if it was one you needed to be a research lab or state to afford.

The combination of the massive data centers you need to run a cloud service and the economics of SaaS subscriptions means that kind of high-powered tool is becoming far more widely available. With the Azure Media Indexer, you can skip building your own media archive completely and put it in the cloud instead.

When did I say that?

The voice recognition in the Indexer is not only useful for videos; it would also work just as well for searching your voice mail or recordings of meetings and conference calls. You could use it to jump to the point in a match where someone scores a goal, or to the bit in the meeting where someone says something useful, or to check your voice mail for the call from the auto repair shop that you're waiting for without having to listen to all your other messages first.

 

Previous Page  1  2  3  Next Page 

Sign up for CIO Asia eNewsletters.