Are voice assistants social robots?

No, they're not. At least, not the way I define robot. In my definition, a robot must be physical and be able to act on its physical environment. Some people call any agent a robot, even if they’re purely software based and only act on data, but I think that’s an unhelpfully broad definition. Hey, agentic AI is all the rage now so can we just call such digital-only systems agents instead of robots?

That said, when it comes to voice assistant agents and robots that are designed to interact with people using natural language, I think there are more similarities than differences. Taking the strong view that natural language interaction includes non-verbal signalling from the body, face, and voice, both voice assistants and social robots share a lot of the sensing, dialogue management and even action planning functionality.

The cocktail party revisited

Let’s start with one of the most basic but interestingly difficult issues that both need to solve: the cocktail party problem. Imagine a virtual assistant or robot in a hospital. It is approached by a family, and one of the family addresses it. It knows this because it can hear a voice through its microphone. But who is talking? And who are they addressing? In fact, how many people are there? Without being able to sense who’s speaking, the agent can’t know. The family move on, having receive the information they needed. On his way out, the father of the family thanks the robot, saying that the information provided was really helpful. Now hang on, thinks the robot, is this a new person? What interaction did I just get praise for?

The cocktail party problem is essentially about identifying who is talking to who, and it can be broken down into identifying how many people are in front of the agent, giving each an ID, and then using face and voice activity detection re-identification algorithms accurately being able to detect who is speaking. The next step is determining who they’re addressing, which will require a combination of linguistic analysis (which includes references to people) and gaze tracking.

This problem is pretty much the same for social robots and dis-embodied virtual assistants. They both need to solve it. One big difference is that robots can move their head around to investigate more of the environment and check if it has all members of a party included. Robots also often come with an array of microphones that help it localise the source of a voice.

Virtual assistants frequently have only a single static camera and microphone, which limits what it can do. Perhaps we’ll see a generation of virtual assistant hardware that comes equipped with a pan-tilt-zoom camera and a microphone array to help it solve the cocktail party problem; it would still be significantly cheaper than a mobile robot.

Non-verbal perception matters

I don’t have to tell you that very few people are rational actors. Not even when it comes to doing business. Emotion, for better or worse, affects how we make decisions, how we process information, how we learn, and how we build relationships.

So, when a person interacts with a virtual assistant or social robot, it is crucial for this agent to form an idea of what the person’s emotion is. Call it an emotional theory of mind. Linguistics - what a person says - is one source of information for this. But what you say is easily controlled. How you say it is far less easy to control. You can’t control your increased heart rate, your sweating, your anxious fidgeting. At least not easily.

If your virtual assistant can interpret facial expressions, tone of voice, body pose as well as what is said, you can infer whether a person is confused, frustrated, engaged with the information it’s providing, and plan the dialogue accordingly. You can create an empathetic agent that mimics a person’s emotions and thereby creating a stronger bond.

Smile!

Which brings us to the actions that a virtual assistant or social robot can plan to make. On the face of it, they’re very different because the virtual assistant is confined on (in?) a screen whereas the robot can act in the physical world. However, given that a lot of the social and emotional signalling are done by our face and voice, there is perhaps not that much difference. Leaving aside touch, which is an interesting modality but still quite some way off to be used by robots, I think social robots and virtual assistants have just about the same capabilities of expressing emotion.

Cartwheel Robotics' vision to build a robot that expresses its emotion primarily through body language is a notable exception, but their robot Yogi is still in very early stages of development.

More fundamentally, when it comes to what to say, social robots and virtual assistants can share the exact same dialogue management systems.

Differences

There’s one difference to point out when it comes to attitudes to these interactive agents. Ultimately, it is a social robot’s ability to act on their physical environment that can really change how people view a robot. The scariest thing a virtual assistant can do to a person is wink at it, perhaps. A human-sized robot can do actual physical harm. And let’s face it, some humanoids you see in social media these days don’t look friendly at all.

Assuming that you don’t want people to be afraid of your robot, that gives you another reason to measure people’s emotion during their interactions. They better feel like they’re in control. And of course, perhaps stop designing robots to look like something straight out of the Terminator franchise.

Quo vadis, virtual assistants?

So, virtual assistants are in many ways like social robots. Where will we find them instead of robots? Given the lower cost point and the fact that they can simply be integrated in any screen, I think you will find them everywhere soon. Business lobbies, web-sites, ticket booths, you name it. One area that intrigues me in particular is their use in Cars. A number of companies are developing sophisticated virtual assistants. BLUESKEYE AI itself is now working with Bosch Evoco to build an emotional voice coach for the car that does exactly what it should: build an empathetic relation with the occupant.

Are you building virtual assistants and want to make them more socially and emotionally aware? Reach out to me, I would love to help you.

Next
Next

How to make a robot social?