New Years are a time for reflection and looking into the future. As my team at VERSA circled up to discuss the state and future of technology, three themes emerged quickly. We see the 2020’s as roaring years where voice technology disrupts, matures, and integrates.

Voice as a platform, multimodal experiences, and voice search engine optimization (SEO) are three of the top themes we’re tracking. In this article, I explore these themes, why they’re emerging now, and their impact on human computer interaction in the roaring 20’s and beyond.

Will the future be enabled by voice? Absolutely.

Famed investor, Mark Cuban, recently stated, “There’s no future that doesn’t have ambient computing or voice activation. None.” So that sounds like a strong “yes”! Among many reasons voice activation is part of our future is simply user expectation. I hear of stories where young children go up to devices and speak to them assuming that speaking to a screen will activate it. Boomers are done with manually typing text messages, instead surreptitiously whispering into their phones at the airport and dinner table. Users are ready to stop laborious finger tapping on mobile phones or constantly searching for the TV remote control under all the sofa pillows. There’s no doubt the voice revolution is well underway and many are already questioning, “Why isn’t this voice-enabled already?” In fact, today’s investments in voice technology are already driving user purchasing, adoption, and retention of hardware and software internationally.

Mobile was a disruptor – get ready for voice as a platform

When mobile hit the scene, there was initial disbelief, a slew of clunky bad apps, and then a flood of adoption and investment. The term “mobile first” indicated “a new approach to planning, UX design, and development that puts handheld devices at the forefront of both strategy and implementation.” Voice is on its way to do exactly the same thing.

In 2020, it is difficult to remember a world without mobile apps. Similarly, there will be a time when we forget we couldn’t speak into our devices in an integrated and intuitive way. But how will this revolution in voice technology come about? How will brands maintain and grow their empires using voice experiences? How will we use voice technology for both daily routines and specialized, boutique experiences? The answers to these questions are worth billions.

smart personal assistant shaped like a cynlinder

Tech giants such as Alibaba, Amazon, Apple, Facebook, Google, and Microsoft are investing heavily in voice technology in competing in both hardware and software. In 2017, Amazon had 5,000 employees working on the Alexa team and by 2019 that number had doubled to 10,000 employees. This Amazonian army is comparable in size to Princeton’s total student body and faculty. It is no secret that Amazon’s strategy is called “Alexa everywhere” and they see a world where all devices are voice-enabled. Similarly, Google recently announced that their CEO is doubling down on investing in voice and the Google Assistant. With all of these resources and investment, what will these products look and sound like in the future? What kind of products and interfaces are these tech giants creating?

white smart speaker with icons floating above it

Voice first interactions – speak and you shall receive

If we know that voice technology is in our future, then what will it look like and how will users interact with it? My team and I believe that the future of voice will be an integrated approach more commonly known as multimodal interaction. But before we look at why multimodal is the future, let’s take a look at the way voice-technology is often approached today.

Voice first” or “voice-first” is the term for user interfaces whose primary or sole functionality is accessible through human speech (see Google patents dating back to the 1990’s). Indeed, this is a misnomer and not the mirror image of the term “mobile first.” For voice first it is not that we are building the voice capabilities before any of the other capabilities; it is that we are building the system exclusively for voice interactivity.

This means you speak to the Google Home Mini and it responds back to you. For voice first, the primary way we are interacting with the device is through human speech. Incredible advances in voice technology have allowed for this magical experience to be a quotidian experience for many but not all users. Let’s remember that speech recognition is far from perfect and text to speech can still sound quite robotic. Both are improving all the time. These issues should smooth over with research, time, and data.

In stark contrast to our current digital landscape where screens, tapping, scrolling, swiping, and listening are the norm, voice requires a different type of active participation. Consider the fact that we unlock our phones by pressing buttons and scroll through email with our fingers, but we ourselves remain silent. We listen to podcasts and music as passive listeners. The voice first world shakes that up. We actively speak to the computer and the computer responds. As the ecosystem of voice technology evolves, we will move quickly beyond voice first interactions with black boxes, of Alexa Echos and Google Homes, to multimodal interactions as well.

Walk in the user’s shoes

Let’s step through a voice first interaction and compare it to why a savvy, well-built multimodal experiences will be preferred. In this theoretical world, my running shoes just gave up the ghost and it’s time to get a new pair. My work schedule is busy, and I’d rather buy online.

User initiates interaction: For a voice first interaction, I am in the living room at the end of a long day and ask my smart speaker device, “I’m interested in buying running shoes, what does ThredUp have?”

Computer response: My device guides me through an experience by asking a series of follow-up questions:

  • “What brands are you interested in?”
  • “What color would you like?”
  • “What size are you interested in?”

Ideally, in this voice-enabled interaction, I’ll find the product I’m looking for, purchase it, and become a satisfied and habitual user of this product, all via voice interactions. But given the nature of this use-case, that seems unlikely.

Here’s what the ThredUp website looks like today. Not only are there images of products, but advertisements, a drop-down bar, branding, style, several layers of categories, and details about each individual product, all useful features for making this purchase and all unavailable in a simple voice first interaction.

Popular second hand retailer, ThredUp, e-commerce page for shoes

Ultimately, can you imagine purchasing shoes without seeing them? It’s doubtful. While eCommerce can be about reordering a product you already know you want more of, say your favorite soap and coffee. But shoes can be an emotional purchase with lots of identity questions swirling around them. Am I running enough? Am I ready to say goodbye to the old pair and step into these ones? Users of sites like Ebay spend hours to read through details, compare products, and maybe eventually purchase some. Both text details and visual components are crucial to the user in many purchasing interactions. And this is where multi-modal experiences comes into play.

In my PhD research, participants reported feeling 24% more comfortable with multimodal than voice first products. This may change, but it does gives a barometer on where users are today. Nearly 50% of users use both touch and voice input when using the new multimodal Google Nest product, reports Google Assistant’s VP of Engineering, Scott Huffman. The importance of visuals may be part of the reason why 98% of people who own an Alexa-enabled device have never purchased items through their device. With a number that low, perhaps the 2% were Amazon employees trying out their own product.

Multimodal – visuals and voice integrated

If visuals are vital for users, what does that mean for the future of voice-enabled interactions? Users want multimodal interfaces. “Multimodal” is a term that means a technology that incorporates multiple forms of interaction, which might include but are not limited to screens, voice technology, augmented reality, virtual reality, touch, and gesture. Where voice first is narrow, its multimodal counterpart is broad, flexible, and adaptable. Solutions should match the use case. Sometimes that will mean voice first and often it will not.  For more involved interactions, like medical devices, social media, gaming, and education verticals, I believe users will strongly prefer multimodal interactions.

For a simple fact like, “Hey Siri, what is Beyoncé’s birthday?” a quick text to speech response might be the right fit. “Beyoncé was born September 4, 1981 and is 38 years old” is Apple’s response for that query today. Note that the iPhone gives a short audio response, shows the visual of the same text, and also provides a visual and other info it pulled from Wikipedia.

Screenshot of Apple's voice personal assistant Siri

For Surfers in Australia, the Echo Show can be used to learn about the surf and tides through a multimodal interface VERSA built for Coastalwatch. By asking about the surf at Snapper Rocks, a user can learn about the temperature and a series of information about high and low tides at a location.

A user asks Alexa what the surf is like at Snapper Rocks. Results appear on visual voice assistants.
Image credit VERSA Agency.

Whether the experience is voice first or multi-modal, one of the most prevalent use cases today is leveraging voice technology to ask questions. We are looking for information about an artist. We are checking the status of the waves. The user is often searching for information. Yet, is that information ready to be searched via voice? That brings us to our final theme.

Voice SEO – are you ready to be voice searched?

One of the major ways voice will make waves in the coming years is through voice SEO. Today, roughly 1 in 8 searches are being conducted via voice query. Google reports that 25% of its searches are by voice query. Notably, Google is adding the Assistant to its Chrome browser. But are other companies ready for this type of voice search?

Ask yourself and your team:

  • Are you ready to be voice searched?
  • Are products tagged with relevant information that might be requested in a voice query?
  • Could a user purchase or complete a transaction through voice query on your site?

“No, absolutely not” is a common gut reaction. If that was yours, you’re far from alone. Most companies have long optimized their content for the needs of web and mobile delivery. However, taking the same content you have prepared for your website and simply replicating it for voice is inadequate. Who wants to listen to long paragraphs of text read aloud? Wouldn’t you prefer a simple sentence that answers your question? To have a successful voice experience, content needs to be structured and managed for voice to provide targeted answers to questions. Sound like a hairy problem? Good news! There is a startup for that, check out Jargon, the voice content management system company that is designing for voice from the ground up.

Branding and marketing are also primed for disruption. What does your brand sound like, literally? How do people say your product names aloud? Puns and brands that made sense visually, may not make sense aloud. How clear is your brand’s voice and tone when they are literally voice and tone? There’s a startup for that also – enter the branding and strategic marketing startup Pragmatic Digital.

Podcasts have already been working on this sound versus visual problem. You may have heard difficult to spell promotional codes or the name of a product they want you to find later. For example, the host may continue carefully repeating, “That’s P-A-R-T-O-N, Parton me,” on the podcast Dolly Parton’s America. If we want users to find content, they need to be well indexed in our backends and part of our natural language processor’s lexicon for both automatic speech recognition and text to speech delivery. If you do not have the talent in house to work on these problems, then it might be a time to hire a conversation designer or developer (very desirable skill sets on the job market), or partner with voice agencies happy to help, such as VERSA.

Voice and human computer interaction – here come the roaring 20’s

Voice will be an integral component in almost all future technology. Already today,  nearly 50% of users use both touch and voice input when using the Google Nest, reports Google Assistant’s VP of Engineering, Scott Huffman. “Speakers can be quite limited,” said Huffman at CES 2020 and “Screens will change everything.” While  it is true that smart speakers are limited, Huffman’s stance here implies that smart speakers are the default.I respectfully disagree.

It is not screens that will change everything, it is voice that will change everything. This is simply the next chapter in the communications revolution – from illuminated manuscripts to the printing press, from print to digital, from screen-based digital to a multimodal, voice-enabled world. The way we process information and communicate has radically morphed. In the past 20 years alone, the digital world was radically impacted by innovations in social media, mobile technology, streaming entertainment services, wearables and more. It is now voice that will disrupt our current norms of usage. We are now in the stage of some initial disbelief, a slew of some good and some clunky voice experiences, and I predict a flood of adoption and investment. Trust me, we will not be going back.

Ultimately, the impact that voice technology will have on the rest of the ecosystem is disruptive, multifaceted, and irrevocable.

Let’s recap the themes from this article:

  • Voice is a platform disruptor, see the parallels to mobile.
  • Multimodal (voice, screens, AR/VR, gesture, touch, etc) is our future.
  • Voice SEO is a huge part of this revolution.

Come join the roaring 2020’s with us! It is a wildly exciting time to be in voice technology.