Recognition and context
The first of these steps consists of converting your spoken words into text--a task that Apple reportedly delegates to voice recognition powerhouse Nuance. Siri does a remarkable job here: Even with my Italian accent--"Noticeable," as a friend once told me, "but you'll never be picked as the next voice of Mario"--I find myself only rarely having to repeat a command.
However, Siri's success at understanding me is possible only because it already "knows" the words I am likely to speak: The service uploads your contacts and other data about you so that it can recognize the information later on with a good degree of accuracy. Apple has programmed Siri to understand all the terms that are required to fulfill the tasks it supports, based on the context in which they are presented.
Due to the vagaries of human languages, this is not a simple problem to solve even with the most advanced technology. For example, the words byte and bite sound exactly the same, but a restaurant-review app is more likely to use the latter, while software destined for a technical audience will more often employ the former. Confusing the two could lead to a dead-wrong interpretation of the resulting text: Nobody wants a few chips of RAM with their dark rye sandwich, but a computer has no concept of the absurd.
In order to allow third parties to take advantage of Siri, Apple would have to figure out a way for developers to "teach" the service about the specific terminology that their software is going to use, and the context in which it is going to be used. As you can imagine, this would be difficult even for simple apps, and nearly impossible for others, particularly if they deal with complex concepts that do not lend themselves well to vocalization.
From words to concepts
Once voice is turned into text, Siri's next job consists of understanding what the user is asking for, a process that relies on an area of science called natural language processing. If you thought voice recognition was difficult, this is many, many times harder, because humans have a nearly unlimited ability to express any given concept using endless combinations of words, and they often say one thing when they really mean another.
To tackle this problem, a natural language system like Siri usually starts by attempting to parse the syntactical structure of a piece of text, extracting things like nouns, adjectives, and verbs, as well as the general intonation of the sentences. That helps Siri determine, for example, whether the text is a question, or whether the person is phrasing things in a way that sounds like they are upset or excited.
Sign up for CIO Asia eNewsletters.