Attention Mechanisms: How Can Knowing Where to Look Improve Visual Reasoning?

The human visual cortex uses attention mechanisms to discard irrelevant information as well as to efficiently allocate computational resources. It has inspired modern machine learning, where attention mechanisms are a vital part of memory modules and are used for modelling object interaction as well as solving complex reasoning tasks. At the Deep Learning Summit in London this September 20 - 21, Adam Kosiorek, PhD candidate at the University of Oxford and Research Intern at DeepMind will explore attention mechanisms for visual tasks and show how they can help to track objects in real-world videos when used in a recurrent framework with a hierarchy of attention mechanisms. Attention is also able to model common assumptions we make about objects - that they do not appear out of nowhere and do not disappear into thin air. This insight lends itself to unsupervised detection and tracking of multiple objects - without any human supervision.

In advance of the summit, we caught up with Adam to hear a bit more about his career to date, as well as his current work.

Give me an overview of your work at University of Oxford & DeepMind

I believe in probabilistic reasoning - in having beliefs over what is happening around us. It is important for them to capture not only the state of the surrounding world, but also uncertainty about it. For this reason, my research is focused on stochastic neural networks and algorithms used to train them, mostly for generative modelling of time-series.

How did you begin your work in Deep Learning, and what came first for you, robotics or artificial intelligence?

My fascination with robotics started when I re-watched the movie “I, Robot” around 2008. I was captivated by the idea of creating a personal assistant for everyone: it could improve life quality for the majority of people on the planet. I haven’t realised back then how difficult this problem is.

It took me several years to find out that the actual parts that fascinate me are the ability to learn and intelligence. What kind of control algorithms do we need to make humanoid robots move in way that is similar to us, humans? How do we endow artificial agents with reasoning capabilities that we have? It was trying to answer questions like that that motivated me to get involved in deep learning. Nowadays, I am interested mostly in the reasoning part. What does it mean to think, and how can we build machine learning models that do that?

What motivates you to keep working in this space?

I find deep learning research extremely rewarding. There is something special in creating systems that can learn to solve difficult problems from scratch. They never work at the beginning, but as we invest time and effort, we can slowly develop an understanding of how they work and slowly improve them.

How are you working to improve visual reasoning?

The majority of research in deep learning for computer vision is about supervised learning. We provide problem specification in terms of pairs of inputs and outputs and we train machine learning models to match the inputs to the outputs. This approach has been highly successful, but it is extremely expensive and unreachable unless you are a big company, generally speaking.

I’m focusing on structured models that can learn useful representations without any human supervision. In this way, we can leverage the unlimited amounts of unlabelled data for learning representations, which then can be used with only a minimal amount of supervision to solve problems. On one hand, it makes learning more human-like; on the other, it makes machine learning easier to use for those that do not have unlimited resources.

What challenges are you currently facing in your work, and how are you using AI to overcome these?

The problem I am working on currently concerns space, or rather how do we make artificial agents aware of the (physical) space around them. This problem is challenging, also because it is not entirely clear what the desired outcome is: we are not sure what being aware of space means.

What are some of the real world applications of your work?

Some of the learning algorithms I have helped to develop can be used for training stochastic autoencoder-type neural networks, which can be applied almost anywhere.

I have also developed two novel algorithms for object tracking, which I will tell more about in my talk. The first one, HART [1], can be used for tracking objects in real-world videos, although it requires human-labelled data for training. The second one, SQAIR [2], allows modelling moving objects without any human supervision and lends itself to creating systems that can detect and track objects in videos.

AI is becoming applied in countless industries - what areas are you most excited to see transformed, and where do you think we’ll see the biggest impact?

I am most excited about transforming the sciences. Machine learning models can be used to enhance and speed up research pipelines in many fields. They allow us to speed up simulation of stars or lower the cost of discovering new drugs, often by orders of magnitude. We can even use ML to improve efficiency of our organisations by helping us to schedule meetings and run them efficiently.

What’s next for you?

I am trying to endow artificial agents with the sense of space and the idea of objects surrounding them. This is a part of my PhD, of which I have a little bit more than a year left. After I graduate I'm not yet sure whether I might start a company or stay in industrial research.

[1] Kosiorek, A. R., Bewley, A. and Posner, I., 2017. Hierarchical attentive recurrent tracking. In Advances in Neural Information Processing Systems (pp. 3053-3061), 2017

[2] Kosiorek, A. R., Kim, H., Posner, I. and Teh, Y.W., 2018. Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. arXiv preprint arXiv:1806.01794, 2018