Due to the inroads of deep learning, computer vision appears to be on the verge of being solved. However, current methods are extremely data hungry and getting high quality labelled data is both expensive and cumbersome. Instead of letting humans do the hard work, can we turn our computers into couch potatoes and program them to figure out our visual world by watching decades of videos? The team at Google has set out to push the frontiers of computer vision by giving an affirmative answer to this question. Christian Szegedy is Senior Research Scientist at Google, working on deep learning for computer vision, including image recognition, object detection and video analysis. We caught up with Christian ahead of his presentation at the Deep Learning Summit in Boston this month.What are the main types of problems now being addressed in the deep learning space? Deep learning is applied successfully to machine perception and large scale data analysis. Prime examples of the former are all kinds of computer vision, speech recognition and music classification tasks. A major section of the computer vision literature of the last two years is dedicated to the utilization of learned deep convolutional network features to a large variety of computer vision problems with huge success. Recurrent neural networks have just started to revolutionize the field of machine translation and text understanding as well.What are the practical applications of your work and what sectors are most likely to be affected? My recent work is focused on various fundamental computer vision tasks: on image annotation, object detection, segmentation and pose estimation. This has laid the ground-work for a lot of the computer vision systems used in Google products. For example, Inception network architectures are at the core of several vision-heavy Google services: personal photo search by image content, face tagging in social photos, business detection/recognition in StreetView imagery. Advances in deep learning pave the way for a future in which utilization of visual signals will be as easy, efficient and ubiquitous as textual processing by computers today.What developments can we expect to see in deep learning in the next 5 years? Current deep learning algorithms and neural networks are far from their theoretically possible performance. Today, we can design vision networks that are 5-10 times cheaper and use 15 times less parameters while outperforming their much more expensive counterparts from one year ago, solely by the virtue of improved network architectures and better training methodologies. I am convinced that this is just the start: deep learning algorithms will become so efficient that they will be able to run on cheap mobile devices, even without extra hardware support or prohibitive memory overhead.What advancements excite you most in the field? The inroads of machine learning will transform all of information technologies. Most prominently, the way we program our computers will slowly shift from prescribing how to solve problems to just specifying them and let machines learn to cope with them. We could even have them distill their solution to formal procedures akin to our current programs. In order to truly get there, the most exciting developments will come from the synergy of currently disjoint areas: the marriage of formal, discrete methods and fuzzy, probabilistic approaches, like deep neural networks.The Deep Learning Summit is taking place in Boston on 26-27 May. For more information and to register, please visit the event website here.
Join the conversation with the event hashtag #reworkDL