Taking a Leap Forward with Multimodal AI

Dr. Shalini Ghosh, Principal Scientist, Samsung Research America

Principal Scientist & Leader of the ML Research Team
Visual Display Intelligence Lab (Samsung SmartTV Division)

Most of us make extensive use of our five senses while interacting with the world around us. We use smell and taste to identify delicious food, we use words and touch to reassure a crying child, we use the sights and sounds of a trailer to identify what movie we would like to watch. Whether we are working, playing or relaxing, our interactions with the world around us uses multiple senses to capture and process signals from multiple modalities, e.g., vision, audio, language.

Artificial Intelligence (AI) has often taken inspiration from human learning to take big leaps forward. One such important stride is multimodal AI, where the AI agent processes video, images, audio, language, etc. and jointly learns how to do a task. For example, let us consider an AI agent in your Smart TV that wants to recommend movies based on your interests and what you’re currently watching. Content recommendation systems have been used in applications like video sharing sites (e.g., YouTube) or online shopping sites (e.g., Amazon) [1, 2].

Let’s take the example of a Smart TV, where we are trying to understand why a user likes a show so that we can recommend similar content to the user. Now, the user may be interested in a TV show because of the lovely scenic background of a shot or an exciting car chase scene -- but the AI agent cannot be certain unless it gets explicit feedback from the user, which is not available at times while the show is running. So, the multimodal Machine Learning (ML) model has to analyze the video scenes along with the audio track and the language (e.g., close-caption text) to have a better understanding of the video. For instance, the image of a crowd and the audio of loud music and the close caption of lyrics would indicate that the scene is taken from a concert. If the AI agent sees that a user consistently watches scenes from a concert or from car chase, then the agent can recommend similar videos to the user. [3].

Processing and understanding multiple modalities also makes the AI agent more resilient to errors. For example, one important task of an AI agent is understanding the sentiment of a video so that we know whether the mood is happy, sad, sombre or exciting. Sometimes sentiment is not obvious from the text of a script, since it may not capture all the nuances -- processing the audio track to understand intonations as well as processing the video to see the expressions of actors can help resolve any ambiguities and help the AI agent be more precise in sentiment understanding [4].

Our AI/ML research group in Samsung Research America, along with other research groups around the world, is busy solving the important research challenges that come up in multimodal AI. We focus on all aspects of multimodal AI -- here are some of the interesting problems that we worked on recently:

1. Visual Dialog: We work on multimodal ML systems that can understand a scene and have a dialog with a user regarding their questions related to the image [5]. This can be very important for visually impaired users, since it gives them a mechanism to understand images that they cannot see with the help of interactions with the AI agent via a natural language dialog.

2. Object Detection: We trained ML based fusion models [6] that can accurately recognize objects in the environment by fusing together knowledge from computer vision and natural language models.

3. Incremental Learning: Like children learn incrementally over time with the help of their different senses (modalities), we train ML models that can learn new concepts incrementally and augment that to the existing knowledge of the multimodal AI agent [7, 8]. This helps the AI agent to learn new concepts without retraining the whole model.

4. Compression: Multimodal AI models can be more accurate, but sometimes that can come at a cost of having larger models. Our research also focuses on model compression, which helps in making ML models compact without losing accuracy [9]. This enables the models to be run efficiently on mobile devices in terms of latency and power.

5. Explainability: We design multimodal machine learning (ML) models that explain their predictions effectively, which makes them more interpretable [10]. For example, when analyzing images for a particular task, we can identify salient parts of the image that were analyzed to solve that task.

We are excited to be working in multimodal AI -- it is a fast-evolving and rich area of AI and ML research, with the potential of having far-reaching impact in several important applications, e.g., smart devices, assistive technology, health and robotics.

References

[1] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, and Dasarathi Sampath. “The YouTube video recommendation system”. In Proceedings of the fourth ACM conference on Recommender systems (RecSys '10), 2010.

[2] G. Linden, B. Smith and J. York. "Amazon.com recommendations: item-to-item collaborative filtering". In IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan.-Feb, 2003.

[3] Bo Yang, Tao Mei, Xian-Sheng Hua, Linjun Yang, Shi-Qiang Yang, and Mingjing Li. “Online video recommendation based on multimodal fusion and relevance feedback”. In Proceedings of IVR, 73-80, 2007.

[4] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. “Towards multimodal sentiment analysis: harvesting opinions from the web”. In Proceedings of the 13th international conference on multimodal interfaces (ICMI '11), 2011.

[5] Heming Zhang, Shalini Ghosh, Larry P. Heck, Stephen Walsh, Junting Zhang, Jie Zhang, C.-C. Jay Kuo. “Generative Visual Dialogue System via Weighted Likelihood Estimation”. In IJCAI, pages 1025-1031, 2019.

[6] M. Ehatisham-Ul-Haq et al. "Robust Human Activity Recognition Using Multimodal Feature-Level Fusion". In IEEE Access, vol. 7, pp. 60736-60751, 2019.

[7] Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry P. Heck, Heming Zhang, C.-C. Jay Kuo. “Class-incremental Learning via Deep Model Consolidation”. In CVPR Workshop on Visual Understanding by Learning from Web Data (WebVision), 2019.

[8] Dawei Li, Serafettin Tasci, Shalini Ghosh, Jingwen Zhu, Junting Zhang, Larry Heck, “IMOD: An Efficient Incremental Learning System for Mobile Object Detection”. In Proceedings of the ACM/IEEE Symposium on Edge Computing (SEC), 2019.

[9] Jie Zhang, Junting Zhang, Shalini Ghosh, Dawei Li, Jingwen Zhu, Heming Zhang, Yalin Wang. “Regularize, Expand and Compress: Multi-task based Lifelong Learning via NonExpansive AutoML”. In IEEE Winter Conference on Computer Vision (WACV), 2020.

[10] Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Dhruv Batra & Devi Parikh. “Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded”. Proceedings of the International Conference on Computer Vision (ICCV), 2019 (Earlier version presented at the ICLR Debugging Machine Learning Models Workshop, 2019).