Neural networks trained on Datasets, like ImageNet, have recently led to major advances in visual object classification. The main obstacle that prevents networks from reasoning more deeply about scenes and situations, and from integrating visual information with natural language, like humans do, is their lack of common sense knowledge about the physical world. Unlike still images, fine-grained prediction tasks in videos can reveal such physical information, because videos implicitly encode properties such as 3-D geometry, materials, "objectness", and affordances.

At the Deep Learning Summit, Roland Memisevic, Chief Scientist at Twenty Billion Neurons, will describe a new video dataset his team have created, showing objects engaged in complex motions and interactions. He will also show how neural networks can learn from this data to make fine-grained predictions about actions and situations.

We caught up with Roland to hear his reason for co-founding Twenty Billion Neurons, as well understanding what the ethical applications of his research are and the key standards Twenty Billion Neurons follow.

Can you tell us why you founded Twenty Billion Neurons?

My personal reason for co-founding Twenty Billion Neurons (TwentyBNs) is that this company allows me to participate in an ambitious technical vision that I could not pursue elsewhere - neither in academia nor in a large research lab at a big company. TwentyBN's mission is to allow neural networks to perceive and understand the world. This means enabling networks to not just output, say, "cat" or "dog" in response to seeing an image, but to fully comprehend a scene or situation and anticipate what will happen next.

For neural networks to gain more common sense requires that they learn about the three-dimensional world, and about how objects behave in it, how they relate to one another, or how they respond to actions. We believe that the only way in which a machine can learn about the real world is by observing how objects behave and respond to actions in the real world, much like humans do when they grow up. Only real-world videos reveal that pixels tend to move not randomly but in certain patterns, and how these patterns relate to objects, materials, depth, motion, inertia, gravitation, etc. Unfortunately, unlike for still images, for which there are datasets like ImageNet, there is no comprehensive video dataset whose annotations would allow networks to discover such fine-grained information about the world. Furthermore, training neural networks on videos is much harder, technically, than training on image data because the data volume is orders of magnitude larger.

At TwentyBN we decided to make videos "first-class citizens" in the deep learning world and to see how far we can push the envelope of visual understanding using transfer learning from video-features akin to how ImageNet-features were used on still images. To this end we have gathered a team of some of the best scalability engineers, researchers and vision experts in Germany and Canada, and created a video crowdsourcing platform that allows us to generate an "infinite" annotated dataset, always showing exactly the kinds of visual concepts our models currently struggle with the most. The only place in the world where you can possibly pull of an operation of that scale and complexity is in a startup. This in a nutshell is why Twenty Billion Neurons exists.

What are some of the impactful applications of the research?

Our long-term vision amounts to teaching networks to perceive and understand the world. Since we believe that the only path towards this ambitious goal is by solving video, we tackle two video use cases as our first products: First, using our video dataset, we built a one-shot video classifier (akin to how people have used ImageNet to build one-shot image classifiers) with which we address safety and surveillance applications. Since our networks have already learned sophisticated spatio-temporal features from the large, annotated database of videos we recorded so far, we can offer this using far fewer training examples than required otherwise. We also offer recording of task-specific videos using our platform. A second application that we recently started to explore is dashcam video- analysis that supports detection of objects and events. We plan to offer this as a feature representation that can complement (for example, by ensembling) the primarily image-based solutions that are out there at the moment.

What are the key ethical standards you follow at Twenty Billion Neurons?

As an AI company, whose goal it is to significantly advance visual perception and "common sense" reasoning in machines, we are well aware of the worries and concerns that naturally come with any significant, new technological advance. While we consider popular "Skynet"-like scenarios as unrealistic fear-mongering, AI does bring challenges to society, and it is our responsibility to be aware of and to address those challenges. They include, in particular, the need for transparency, which we address by pursuing research in an open discourse with the deep learning community, and the need to maintain the human benefit of the developed technology, which we address by carefully selecting applications.

Roland Memisevic and other great speakers including: Andrew Tulloch, Research Engineer at Facebook, Danny Lange, VP of AI and Machine Learning at Unity Technologies and Bryan Catanzaro, VP of Applied Deep Learning Research at NVIDIA will be presenting at the Deep Learning Summit. View the agenda here. Also, view all announcements during the event as they happen on our Storify page.

The summit will be running alongside the Virtual Assistant Summit where you can hear from the likes of Slack, Google, Pixar Animation Studio, University of Washington and x.ai. View the agenda here.

Upcoming Summits Include:

View the full events calendar here.