For example, they take input like the waveform below and transform it into the words "I want pizza!"
The result of this process is just a string of words. In order to make use of them, they have to reason about the words, what they mean and what you might want, and how they can help you get what you need. In this instance, doing this starts with a tiny bit of natural language processing (NLP).
Again, each of these systems has its own take on the problem, but all of them do very similar things with NLP. In this example, they might note the use of the term "pizza," which is marked as being food, see that there is no term such as "recipe" that would indicate that the speaker wanted to know how to make the pizza, and decide that the speaker is looking for a restaurant that serves pizza.
This is fairly lightweight language processing driven by simple definitions and relationships, but the end result is that these systems now know that the speaker wants a pizza restaurant or, more precisely, can infer that the speaker wants to know where he or she can find one.
This transition from sound, to words, to ideas, to actual user needs, provides these systems with what they require to now plan to satisfy those needs. In this case, the system grabs GPS info, looks up restaurants that serve pizza and ranks them by proximity, rating or price. Or if you have a history, it may want to suggest a place that you already seem to like.
Once all of this is done, it is a matter of organizing the results in a sentence or two this is a process called natural language generation, or NLG. These words will then turn into sounds (speech generation).
Broad AI, narrow AI
The interesting thing about these systems is their mix between broad and narrow approaches to AI. Their input and output -- speech recognition and generation -- are fairly general, so they are all pretty good at hearing what you say and giving voice to the results.
On the other hand, each of these systems has a fairly narrow set of tasks they can perform, and the actual reasoning they do is to decide which tasks (find a restaurant or find a recipe) they can accomplish. The tasks themselves tend to be search or query-oriented, sending requests for information to different sources with different queries based on text elements grabbed from the speech. So the real smarts inside these systems is essentially answering the question, "What do you want me to do?" by identifying the terms that indicate your wishes.
Sign up for CIO Asia eNewsletters.