Here is a scenario: you register for a webinar but cannot attend it. You get a recording of that webinar and it languishes in your inbox because you just can't seem to find the time to watch an hour-long webinar. Sounds all too familiar?

How about product reviews? You want to check out the features of the all-new MacBook, but you have to scroll through the entire video to find some information on the hardware specs and clock speed, as you don't really care much for the cosmetic updates.

Ever felt excited about watching a conference session without really knowing where the topics you are looking for are mentioned? Most likely, no.

Videos are an incredible medium of information dissemination and consumption for enterprises, and yet there are hardly any who make effective use of videos. We've built processes and techniques for and around video creation, distribution, and measurement, but somehow the whole point of a video - its consumption - fell through the cracks.

Most people do not have the patience to watch an informational video beyond the first few minutes. So all webinars, conference recordings, reviews, tutorials etc., however good they may be, simply do not live up to their full potential and very rarely does anyone watch them in their entirety.

And this is the idea behind VideoKen - making videos richer and more consumable. We got the inspiration from textbooks, which people don’t just start reading from cover to cover. Readers look at the table-of-contents at the beginning and/or the index at the end to identify the key topics being covered in the book, and then read the sections covering those topics. Unlike textbooks, videos do not come with these indices, so we apply AI to create these indices automatically. The Table of Contents (ToC) and Phrase Cloud automatically created by VideoKen summarize the key topics being covered, and let the viewers navigate directly to the corresponding points in the video. In addition, VideoKen creates a full transcript of the video and provides a search capability.

We start with automatic speech recognition, using an engine built upon the Deep Speech 2 Architecture. We apply a custom model based on LSTMs to create a complete transcript with punctuations. The goal of the Phrase Cloud algorithm is to automatically select the principal phrases, ideally corresponding to key concepts, among the thousands of words appearing in the transcript.

The first step in this process is a custom phase extraction algorithm. These phrases are ranked using statistical analysis, assisted with a knowledge base, and are clustered based on the domain associated with each of them. The lower ranked phrases in each cluster are further pruned to arrive at the final set of phrases that are displayed in the Phrase Cloud. The timing information from the transcript is used to map these key phrases to the video timeline.

The algorithm for Table of Contents generation utilizes visual data in the form of slides or other forms of typed text appearing in the video lectures. We use a custom-trained RCNN (region-based convolutional neural network) model to detect different region boundaries in a video frame. Currently, we detect different types of regions including screen content, the human body and the surrounding environment (e.g., lecture hall). For the screen content, we use a visual text model to identify frames corresponding to slide transition points, and further classify the video frames into categories like a slide, demo or handwritten text, while also identifying the text frame over which OCR should be done. We use domain data to perform OCR corrections. A visual saliency analysis helps to identify the candidate titles and we obtain the final set of ToC entries following a global postprocessing step to eliminate near duplicate and less important titles.

The ultimate idea behind any piece of content is that the consumer needs to benefit from it, one way or the other. The information has to be disseminated in a manner that is clearly understood and is easy to find. VideoKen nudges video consumption into this direction, transforming how we view and interact with videos.

Here is a very quick example of how VideoKen works, implemented on a video from a RE•WORK conference.

Now, the entire topic list is neatly summarized in the ToC and the phrase cloud and makes for a much friendlier user experience. But it's not just about the aesthetics. Our data shows that this form of video viewing increases the amount of video watched by 2-4X. Viewers are much more invested in and engaged with the video now that they know its contents and know where exactly to find them.

Enterprises produce and use videos for a variety of purposes - from webinars, sales enablement, product training and demos to events, marketing, L&D, and internal communications. When videos are such a big part of an enterprise business strategy, why diminish their importance by not extracting their full potential?