Bots, bots, bots! We’ve heard about them, we’ve seen them, we’ve likely used them- maybe without even knowing it. In May, Facebook announced that there were over 300,000 bots on Facebook Messenger, and of course, Microsoft CEO Satya Nadella famously announced in 2016 that “bots are the new apps.”

In this post, I am not going to talk about chatbots, partly because there’s not a whole lot new to say, but mostly because at CloudMinds, we’re working on what’s next. Our goal is to have a robot (an actual physical one!) in everyone’s home by the year 2025. That’s not that far away! And one of the first things you’ll want to do with your home robot is to have a conversation with it (OK, “robot, do the dishes” isn’t really a conversation per se, but you get the idea; you’ll talk to it)!

So let’s get into how you’ll have a conversation with your robot.

AI and Conversation

How does this work, conversing with a machine? Let’s start with Figure 1 below. You’ve got 5 major parts:

  • Speech To Text (STT) or automatic speech recognition (ASR) takes the sound of what you say and converts it to text.
  • Natural Language Understanding (NLU) takes that string of words and finds out what the speaker means. What do you want the robot to do? What question do you have?
  • Dialog Management (DM) takes the meaning of that particular sentence, along with many things like the history of the conversation, the preferences of the user, and application data (like when a store is open), to figure out what to say or do next.
  • Natural Language Generation (NLG) converts the output meaning from DM into a string of words to say. Frequently DM and NLG are done together.
  • Finally, Text to Speech (TTS) takes the output string of text and produces speech! TTS “engines” can produce speech with a variety of “voices” (male, female, and child) and other tweaks like pitch and rate.

Now, I should mention that, as is shown in Figure 1, as a general rule, the more to the left of the diagram you are, the easier the problem is, and the more to the right (e.g., Dialog Management), the harder it is. Don’t get me wrong; STT, for example, is a very hard problem, and some very smart people have been working on that for upwards of 50 years. But we do have very good STT; witness systems like Siri or Alexa. We don’t have very good non-hand-built systems for DM, though.

Figure1. System for having a conversation with a robot.

It’s Engagement, stupid!

So now we know what the parts are that allow you to talk to a robot. What makes some robots such that people want to talk to them, while others are dreaded (think telephone phone trees, e.g. “press 1 for this, 2 for that…”)?

Figure 2. People having fun with Pepper. This is engagement.

Take a look at Figure 2. It’s pretty clear that these folks are actually having a good time with  Pepper, a humanoid robot build by SoftBank Robotics. Why is that? What makes this Pepper different from the phone trees from hell?

It’s sometimes hard to quantify exactly, but we call this engagement. The user is engaged. It’s not that much different from when you’re talking to someone else; sometimes you spend hours and hours and the time flies by, sometimes you can’t wait for it to end.

So what makes it engaging? Much of it has to do with being natural; the robot has to respond as you might expect an engaging human to do, not, well, robotically! And not only natural in terms of the text of the response, but also the voice, and the gestures, and facial movements, and everything!

And this non-robotic naturalness is really hard to do, and we haven’t yet cracked it yet. We’re really good with STT, and very good especially over the last few years with NLU for understanding, but we’re not very good yet in terms of building DM that provides and engaging natural dialog. There’s a yearly competition called the Loebner Prize that looks for the most human-sounding chatbot; pretty much all of the winners of the prize win by building systems with tens of thousands of hand-crafted rules. AI researchers are working on doing that automatically, using all the tricks of deep learning, LSTMs, and all that, but we’re not there yet.

Human-powered AI

Enter humans! Sounds funny, huh, using humans to make robots more natural. This is what we do at CloudMinds. We have a cloud platform for operating robots like Pepper and many others; our platform not only includes AI for the parts you saw above, but also human operators for backup, when the AI is not good enough yet. We call this HARI [1], for Human Augmented Robotic Intelligence, and it’s shown in Figure 3.

Figure 3. CloudMinds' HARI, or Human Augmented Robotic Intelligence.CloudMinds uses HARI to provide the answer when the AI can’t (at least yet), so that we get the natural, engaging interactions that we saw with Pepper.So there you have it! Humans and AI working together to bring us robots that we’ll actually want to have conversations with!Co-authored with Charles R. Jankowski Jr., Ph.D.. Charles is Director of AI and Robotics Applications at CloudMinds. He is co-author of “Voice-First Development,” soon to be released from Manning Publications, and a contributor to “Ubiquitous Voice: Essays from the Field.”

[1] For more information on our HARI platform, please see

CloudMinds will be exhibiting at the Applied AI Summit in Houston this November. If you're interested in hearing more about their exciting work, don't forget to sign up here.