Visual recognition has witnessed significant improvements thanks to the recent advances of deep visual representations. In its most popular form, recognition is performed on web data, including images and videos uploaded by users to platforms such as YouTube or Facebook. However, the role of perception is inherently tied to action. Active perception is vital for robotics. Robots perceive in order to act and act in order to perceive.  At RE•WORK's Deep Learning in Robotics Summit in San Francisco last month, Georgia Gkioxari of the Facebook AI Research Institute (FAIR) introduced the latest AI technology entitled "Embodied Vision". Embodied Vision is used in contrast to Computer Vision, referring to the cognitive ability of a robot, while Computer Vision means 'vision of a robot' (Agent). Emphasis is placed not only on the robot grasping surrounding objects but also on understanding its meaning like a human being.

Learning Method

FAIR educates robots from three perspectives:

  1. Robots learn the meaning of words by looking at things in the virtual environment. This is called "Language Grounding", and the robot connects objects and names in the environment (a long green candle can be found in the room).
  2. Second, the robot moves to the designated place in the house. This is called "Visual Navigation", the robot follows the passage in the house, opens the door and moves to the designated place (the robot moves to that place when instructed to go to the bedroom).
  3. The third is that when the robot receives a question, it moves through the house and finds the solution. This is called "Embodied QA" and the robot moves through the virtual environment to find answers. Conventional robots find answers on the Internet, but Embodied QA moves through the physical society and seeks solutions. For example, when asked "What is the colour of the car?", the robot understands the meaning of the question and starts searching for a car inside the house. The common sense that the car is parked in the garage works and goes towards the garage in the house. The robot does not know its location, but here again, it uses common sense and guesses that the garage is outdoors. Therefore, the robot goes outdoors from the entrance, moves in the garden, and reaches the garage. Then, the robot discovers the car and grasps that the colour is, for example, "orange".
Necessary Function

To perform this task, a wide range of AI techniques are required for the brains of the robot. Specifically, we need Perception, Language Understanding, Navigation, Commonsense Reasoning, and Grounding of words and actions. The research team of Gkioxari succeeded in building the model of Embodied QA and executing the task in the aforementioned 3D virtual environment "House 3 D".

Robots Brain

In this model, the brains of the robot are composed of Planner and Controller educated by Deep Reinforcement Learning method. The planner is the commander, decides the direction of travel (front and back, left and right), the controller is the executor, and determines the speed of advance (the number of steps) according to the instruction. The planner is composed of a network of the type Long Short-Term Memory (LSTM), and as described above, it is educated by Deep Reinforcement Learning method. Planner learns common sense while repeating trial and error like humans.

Development of Intellectual AI Stagnates

FAIR is developing intelligent robots through these studies. AI is evolving rapidly: image judgment exceeds human ability, AI surprised the world in the world of Go beating human beings champion. Although overwhelmed by the immeasurable ability of AI, AI is far from intellectual. AI does not understand the meaning of objects (e.g. cats) and can only perform limited tasks such as Go (e.g, AlphaGO can not drive a car). The present robot can not even move in the house like a human being. In other words, the development of AI that can intelligently think like a human being is stagnating without breakthrough.

Elaborate Virtual Environment

For this reason, FAIR develops AI with an entirely different approach. We educate AI in a 3D virtual environment simulating real society, and the aim is trying to learn complicated tasks themselves in this. By AI learning in the real world, we develop human-like vision, natural conversation, develop the next plan, and develop algorithms that can make intellectual thinking. To do this, we need a virtual environment that looks like a real world, and we are developing a 3D environment that depicted faithfully as if we were photographing the inside of the house. Likewise, OpenAI and Google DeepMind also take this approach, and development competition of Deep Reinforcement Learning is intensifying in elaborate virtual environments.

Facebook Develops Robot

Human life is fundamentally changed by making the brain of the robot intelligent. Facebook has developed the Virtual Assistant "M", but has since stopped releasing it as a product. M was a specification like the hotel concierge that answered any questions, but the conversation topics with humans were too wide and the AI could not respond to this. Embodied Vision is an important basic technology supporting virtual assistant and AI speaker. Furthermore, as this research goes well, a roadmap for home robot development will come into view. Market attention has gathered whether Facebook develops intelligent home robots.

Watch the video of Georgia's presentation here.