There is no doubt that the next frontier for human interaction disruption will be the voice. However, talking to a machine has never been easy for humans, and accepting this interaction as an integral part of their natural communication ecosystem is even more difficult. So, what will make voice interaction compelling enough for people to integrate it into their daily lives?
For the sake of discussion, let us break voice-based interaction into two simple types. The first type is about giving instructions to machines to perform tasks (e.g. Call Adam or Setup a meeting with Clark at 5p.m.). Unlike traditional GUI-based interactions, voice-based interactions completely do away with the need of navigating to different tasks and other UI controls (action buttons, etc.) to initiate them.
The recently launched Microsoft Xbox One showed how compelling user experiences could be delivered by combining voice and gesture-based interaction across Gaming, TV viewing and talking to friends on Skype. Such voice interactions would heavily rely on the system’s ability to recognize voices, with the degree of reliability that people have come to expect from traditional GUI-based interactions.
The second type is really the next level, which is not just about instructing machines to do a task, but also about seeking information like what the capital of Norway is, finding the good places to eat Italian breakfast in New Jersey, and so on. Players in this space are Apple and Google, who are continuously learning through their experiments. We will begin to see new services that listen in the background and deliver genie-like wishes for everything.
The improved avatar of Siri is positioned as human assistance with witty faux-human qualities. Google Now’s entry into the iOS space is positioned as lightning fast and a geeky expert that predicts your queries. These are opening up a new battleground in the smartphone market. The key to success with these voice-based experiences will not only depend upon great voice recognition and interpretation capabilities, but the ability to respond to users with intelligent and meaningful answers by being contextually aware. It means that such services would rely on understanding personal, physical, temporal and social aspects of users.
The knowledge of users’ context is spread across a variety of other apps and services that they use – like booking flight tickets for a family vacation using an airline app, renting a car using another app, setting a reminder to start early from office, keeping boarding passes in Google wallet and so on and so forth. The next level of voice command-based system intelligence could come from integrating with the data from these apps that reside on an individual device, eventually leading to more effective and responsive voice based interactions.