That, of course, was just the baseline. As anyone who’s used dictation software can tell you, the key to accuracy is training. Over time, a voice dictation program learns your accent, whether you pronounce the “a” in apricot like “bad” or “ape,” and how to filter out our unconscious verbal tics. I’ve seen Microsoft employees claim that, properly trained, Windows’ speech recognition was 99% accurate. Ten mistakes or so per 1,000 words isn’t bad at all.
Very few of us, though, probably want to spend the time training the software. Windows Speech Recognition requires up to 10 minutes to run through a few practice sentences, and it feels like a lifetime. Cortana and Siri don’t require any of the same setup time, as they’ve already been trained on millions of voice samples. There’s something to be said for instant gratification.
Training speech within Windows is a lengthy process. The setup time associated with Nuance’s Dragon software is far shorter, perhaps a minute or so. But modern digital assistants recognize your words instantly.
What makes Cortana (which you can use on your PC or phone) so much better than Windows’ own ancient voice dictation systems is her link to the massive computational power of the Microsoft cloud. Microsoft can crunch and correlate your voice input together with whatever other data Microsoft knows about you, generating the intelligence that is the soul of Cortana.
Microsoft talks up speech recognition
Given Cortana’s proven skills, you’d think speech would have taken center stage at Microsoft’s Ignite show last week. But Ignite contained exactly zero sessions on voice dictation and apparently just one on speech recognition. Meanwhile, CEO Satya Nadella’s keynote address painted speech recognition as a critical component of Microsoft’s future.
Take Skype Translator, for example. Microsoft’s Star Trek-like universal translator depends upon three different strands of research, according to Nadella: speech recognition, speech synthesis, and machine translation. “So you take those three technologies, apply deep reinforced learning and neural nets and the Skype data and magic happens,” he said.
“Even inside of Word or Outlook when you’re writing a document we now don’t have simple thesaurus-based spell correction,” Nadella added, adding that Office can now even compensate for dyslexia. “We have complete computational linguistic understanding of what you’re building. Or what you’re writing.”
But not what you’re saying, apparently.
Microsoft chief executive Satya Nadella stands next to NFL star Deion Sanders at Microsoft’s Ignite conference. Has Microsoft fumbled its dictation opportunity?
During the same speech, Nadella bragged that Microsoft’s speech algorithms achieved a word error rate of 6.9 percent using the NIST Switchboard test. That sounds bad: that’s accuracy of about 93.1 percent. But the Switchboard test uses sample rates of just 8KHz, about the quality of a telephone conversation in the year 2000. Windows Media Audio 10, the codec within OneNote, can capture audio at up to 48KHz, providing much more accurate samples.
Sign up for CIO Asia eNewsletters.