Author: Laura Palacio Garcia, Data Scientist, LiveSmart
Nowadays, there are many apps that allow users to track their diet and determine nutritional information about their food. However, most of these apps involve manually typing every single food you eat for every meal, which can be very tedious for the user.
At LiveSmart, we wanted to build a model that would allow users to upload a photo of their meal and be able to identify the individual food items contained on the plate.
We investigated existing algorithms for image classification but nothing fit our requirements exactly either because common models do not allow to identify multiple food objects on a single plate, or they were trained using many extra categories unrelated to food, or they did not perform well enough for our purposes.
Therefore, we needed a two-step approach to achieve multi-object classification and detect food items through object detection and then identify them via image classification.
Since we needed to identify objects and classify images, we needed two supervised learning models. Supervised learning is a method that consists of using labeled training data to train a function that can then be generalized for new examples.
In order to perform these tasks, we retrained already-existing models called YOLO v3 and Inception v3 (used for object detection and image classification, respectively) by providing them with labelled objects and food images. This technique is called transfer learning.
Both of these models were chosen because they have been shown to perform well on large image datasets (14 million images from ImageNet) and to have the most optimal neural network architectures and parameter choices. They both had the option to retrain the classification layer with our own food image dataset, meaning we could combine the feature extraction capabilities of these models (which had been trained for hundreds of hours in GPUs) with our own bespoke dataset and requirements (22 food categories, rather than 20,000 general categories in ImageNet).
Food image dataset
We built a training dataset with our custom food images to retrain these models, by downloading 35,000 food images from various sources including Google Images and ImageNet. We chose 22 food categories to fit our business needs, resulting in having over a thousand images per category.
The performance of our model and its ability to generalize to new images would depend largely on the amount of images we had and how diverse every category was (for instance, we needed to have enough images of all types of fruit in the fruit category).
Then, we prepared the training dataset by validating and adding labels to the single food images. In addition to this, we manually created bounding boxes for images with multiple food items.
Finally, retraining was done on high-performance GPUs.
Object detection and image classification
In order to build our food recognition model, we used a state-of-the-art model called YOLO v3 to perform our object detection.
YOLO applies a single neural network to the full image, making it relatively fast. This network divides the image into regions and predicts bounding boxes for each region. Then, we used these bounding boxes to separate the objects on the plate, which were then separately classified using our retrained Inception v3 model.
Finally, we tested the performance of our Inception model using a 4-fold cross-validation for 200K iterations. Cross-validation is a technique used in machine learning models to test if the model will generalize to an independent data set.
In order to do that, we split our entire dataset into training, validation and testing datasets. Then, the procedure was repeated k times (where k = 4 in this case), yielding to 4 random partitions of the original sample. This allowed us to train and validate the model with different images every fold.
The validation dataset was used to fine tune the model hyperparameters (such as the learning rate) and choose the ones that minimized the error on the validation set the most. Then, the model was trained on the full training set using the chosen parameters and the error on the test set was recorded. After performing the 4-fold cross-validation, the average test accuracy of the image classification model was 75.8%.
In conclusion, we built a model that performed object detection and image classification of a food plate image. After identifying the ingredients of the plate, we also provided the user with meaningful information such as nutritional content and colour diversity of the plate.
Are you interested to learn more from Laura Palacio and her work at LiveSmart at the AI Assistant Summit in London this September 19 - 20.
Author Bio: Laura is a data scientist at LiveSmart, a health startup that provides an integrated solution to empower employees to optimise physical and mental wellbeing at work. She studied Biomedical Engineering in Barcelona, Spain and then did her MSc in Human and Biological Robotics at Imperial College London before starting to work for LiveSmart. She has previously worked as a data scientist in a biotechnology institute and a tech startup.