IBM Speech Sandbox

IBM Speech Sandbox

How do you design for voice? Venturing into uncharted territory.

Wait, how does this work?

I had never used Virtual Reality, or honestly heard of VR, until I worked at IBM. So first step to this project was actually using a VR headset. One of our developers had created a sandbox (kind of like a play area) using an HTC Vive and Watson Speech to Text services. As a user, you would start in a training area, where you learned the commands and gestures, and then were dropped in a free area to create/destroy/move as you wanted.


After trying it out, we had to see how other users, specifically users who had never used VR before, could understand how to use it.

"Make a dragon."

Because the VR environment is immersive and users are fully absorbed in the experience, slightly unintuitive behaviors can be extremely jarring in unpredictable ways. Therefore, it’s important to test often with real users as you are building. We conducted observational research with some guided questions.


We observed how users interacted with the controllers, what items they tried to create, and what phrases they used to complete certain tasks. We would first let them play around in the space with no instruction to see what they did, and then ask them to complete tasks such as, "How would you get rid of an object?" and "How can you move across the space?" We also inquired when a user tried to do an action or say a command that didn't work to see what they were expecting to happen vs. what actually happened.

What did we learn?

Users prefer to direct their speech.

The first thing we learned was that users prefer to direct their speech at an action or object instead of talking into empty space. Users feel more comfortable talking to someone/something or have a direction to speak. Without audio prompts, users would kind of stand there and awkwardly not do anything, waiting for some kind of direction. Users would also use speech commands less often if they didn't know where to point their voice or what to say. As soon as we added the laser pointer to the game, users were able to intuitively direct their speech towards the pointer and therefore became much more comfortable speaking.


Set expectations for commands.

At first, we allowed the users to ask anything, just to see what they wanted to do in the environment. They went in all kinds of directions, since technically, there is no limit to voice (unlike a remote, etc.). When a user voiced an unsupported command, they would wait to see what happens and then try another similar command. Eventually, they got frustrated and gave up because nothing was working. To rein them in and have them avoid unsupported commands, we added in a tutorial that let the user know what aspects of the environment can be affected by voice and which commands to use. Once the users went through the tutorial, they had a framework of what they could actually do in the space.


Ambient listening, not push to talk.

At first, we designed the experience to have a "push-to-talk" button on the controller, but users quickly forgot it was there or forgot which button it was. They would also hold the controller up to their mouth as if there was a microphone inside (which there wasn't). This indicated that if there is a button to speak, the natural inclination is to speak to that button. In our environment, you could create or destroy objects at any place and time so we switched from a voice button to ambient listening. This also allowed users to interact with objects with the controllers while they spoke instead of choosing one or the other.


Once we felt comfortable with the experience, we showed it at South by Southwest (SXSW) 2017 at the IBM Experience. I ran the booth and got to conduct some observational research and see how people interacted with the game in a conference setting. It got great reviews and positive press, but we also learned some more important insights. Firstly, it is not made to work with international accents. We had to awkwardly watch some visitors struggle to have the system understand their commands and explain to them that this was our first attempt at it and it's not perfect yet. We also learned that if this experience is to be used in conferences, it needs to get better at hearing the user in a noisy crowd. Sometimes users had to repeat their commands multiple times before Watson heard them. We had to gently encourage them to try again until it worked.

The project was picked up in May 2017 to be a part of Star Trek Bridge Crew, a VR game created by Ubisoft. Read about it here and here. The game already allowed you to talk to other real humans in the game but because of our technology, it will soon allow you to use your voice to issue orders to computer-controlled characters, too.