"The final missing piece in AI is Visual Cognition and Understanding. In order for this dream to be realized, it takes more than winning scores at classifying ImageNet."

At the Deep Learning Summit in San Francisco, 26-27 January, Brad Folkens, Co-Founder & CTO at CloudSight, will be discussing his experience in scaling, quality control, data management, and other important lessons learned in commercialising computer vision in the marketplace, including the procurement of the largest training dataset ever created for Visual Cognition and Understanding through Deep Learning. Brad Folkens, is leading the effort to build the world's first cognition platform to power the future of AI.

I caught up with Brad ahead of the summit to hear his thoughts on the factors that have enabled recent advancements in image recognition, the main problems being addressed in the deep learning space, and the problems CloudSight solve. Plus, Brad's predictions for computer vision in the next 5 years.

What started your work in deep learning and computer vision?

We started with this idea of mobile visual search in CamFind, and when we looked at the state of the art technologies to drive our product, it became very clear that deep learning was going to win out in the long-term. What we didn’t realize at first was how much demand we would get for the platform that drives CamFind (CloudSight), and that really led us down the rabbit hole with deep learning to build the first visual cognition and understanding engine.

What are the key factors that have enabled recent advancements in image recognition?

Recent advancements in hardware have sparked the (re)-ignition of neural networks as a tool for solving problems.  Advancements in accuracy have been dramatic enough to create an entire shift in focus of the computer vision community towards deep learning to solve existing problems in new creative ways, and that has lead to a rapid acceleration in the field. For us, it’s been the data - being able to directly curate and control the metadata for the over 400 million images we’ve now processed has given us tremendous advantage with training neural networks.

What are the main types of problems now being addressed in the deep learning space?

Vision has been one of those problems that has been difficult to solve with traditional machine learning approaches. Specific problem domains can be addressed with certain human-made algorithms, but in the general case, it’s been a failure. It wasn’t until deep learning re-emerged that we were able to solve vision-related challenges with unreasonable accuracy, and that has paved the way for true visual cognition in AI. Autonomous driving and finance are two other fascinating cases where deep learning has the ability to solve long-standing problems. Take NVIDIA’s driving platform, for example - it's a remarkably simple deep learning approach, but it’s ability to negotiate even the most difficult driving conditions is astounding.

What problems does the CloudSight solve?

We built CamFind to address the difference in search behavior for mobile users, versus the traditional desktop search experience (long-query, pogo-sticking through search results for research, etc). What we realized in building CamFind was that there was a greater need for visual cognition and understanding through the tremendous demand we’ve gotten from many Fortune 500 companies that now use our platform. We revisited our simple CamFind API and rebranded and re-released it as CloudSight, which has now become the flagship for the company.  CloudSight focuses on either fine-grained object recognition and captioning, or whole scene understanding and captioning, whichever the customer chooses when configuring our API.

What developments can we expect to see in computer vision in the next 5 years?

We’re particularly excited about the idea of visual cognition and understanding in artificial intelligence. In particular, what happens when a computer “sees” something?  Does it actually understand, cognitively, what it’s looking at to a degree that it can explain or reason about what it sees? In the state of the art, computers can recognize features, obtain focus on something, and assemble the appropriate language to “explain” imagery, but that’s about where it stops - computers don’t quite understand anything beyond the image.

Take a Starbucks coffee mug, for example, and show it to classic computer vision, and it will say something like “Starbucks” - it has no idea it’s a coffee mug, just that the Starbucks logo exists in the image. With the state of the art, we can understand that it’s a coffee mug and that it’s also of the Starbucks brand and maybe that it’s on a table or in a coffee shop. However, if we break the mug, shattering it to pieces, cognition would understand this broken state, and explain it as a broken ceramic Starbucks coffee mug, having only learned how things are broken or repaired outside of the coffee mug domain. It’s this level of cognition that we think will truly make AI something that we’ve only previously dreamt about.

Outside of your own field, what area of deep learning advancements excites you most?

I think the finance and medical fields can benefit tremendously from deep learning. In both of these areas, specialists spend many, many hours learning to recognize particular patterns (both visually and extra-visually), and that’s exactly where deep learning really excels. Imagine the type of cognitive space an Emergency Room doctor would have if they weren’t bothered with the noise of monitoring so many things at once - their minds could be free to work in solving higher-level problems, especially creatively, without being bogged down by tasks that a neural network can handle.

The same things goes for finance, traders often call upon their intuition, but what is intuition, really? In my experience, what we call intuition is just the periphery of features and data that we cannot focus on, but our minds continue to track unconsciously. Neural networks can track infinitely more than we can, even in our periphery, and so with these tools it’s possible to rapidly improve upon existing, incumbent technologies.

To hear more from Brad, as well as Tony Zebara from Netflix, Danny Lange from Uber, Andrew Tulloch from Facebook, Andrej Karpathy from OpenAI and Ofir Nachum from Google, register now for the Deep Learning Summit, San Francisco. View the full agenda here. Places are now very limited!

The Deep Learning Summit will also be running alongside the Virtual Assistant Summit, 26-27 January in San Francisco, meaning attendees can enjoy additional sessions and networking opportunities. Also, check out our article on 'How Deep Learning is Expected to Develop in 2017'.