This repository contains a presentation and code demonstration of Food Recognition using Vision Transformer (ViT) with OpenAI's GPT-3.5 Chatbot.
Presenter: Yitian(Ewan) Long
- Overview
- Model Card & Dataset Card
- Interactive Demonstration
- Code Demonstration
- Food Recognition with Chatbot
- Critical Analysis
- Resources & Citations
Overview of Vision Transformer Classifier for Food Recognition with Chatbot
Traditional image recognition models, CNNs (Convolutional Neural Networks), recognize relatively general categories. For example, in the past, Apple's iPhone Photos could accurately identify apples, bananas, steaks, etc., using a CNN model. However, for more specific food categories, such as apple pie, chicken curry, and kung pao chicken, it might not have been able to precisely identify them in the past.
This is mainly because the differences between these food categories are very small, and traditional CNN models struggle to extract effective features for differentiation. In addition, traditional CNN models perform poorly when dealing with long-distance dependencies, makes it more difficult to extract features for these fine-grained food categories.
That's why we are introducing the Vision Transformer (ViT) model, which is more suitable for handling long-distance dependencies and extracting global features through self-attention mechanisms.
-
Self-attention-based architectures, such as Transformers, have become the model of choice in natural language processing (NLP). However, convolutional nueral networks (CNNs) remain dominant in computer vision tasks (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016).
-
Google's research team applied a standard Transformer directly to images with the fewest possible modifications. By dividing the image into fixed-size patches and provide the sequence of linear embeddings of these patches as input to a Transformer, images are processed the same way as tokens (words) in an NLP application, and train the model on image classificationin supervised fashion.
-
When trained on large datasets (14M-300M images), they find that large scale training trumps inductive bias.
-
The ViT model pre-trained on the public ImageNet-21k dataset, it approaches or surpasses the performance of the art on multiple image recognition benchmarks.
-
We aim to train the ViT model for food recognition, which can accurately and effectively identify more specific food categories.
-
Google's ViT model has been proven to be effective in image recognition tasks, so we fine-tuned the model on a food image classification dataset to achieve high accuracy in this project.
-
The food image classification dataset provides a solid foundation for fine-tuning a robust and precise image classification model.
- Model Name: Food Type Image Detection Vision Transformer
- Original Model: Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224.
- It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository.
- Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.
- Does not provide any fine-tuned heads, as these were zero'd by Google researchers.
- Model Type: Image Classification
- Model Architecture: Vision Transformer (ViT)
- Fine-tuning:
- Fine-tuned on Food Image Classification Dataset by using 12 varieties of these 35 varieties
- When attempting to use Google Colab Pro to train the entire dataset, the system automatically disconnects from Google's servers about five hours into the training, resulting in the loss of the uploaded dataset, all data trained so far, and all installed packages.
- Optimizer: AdamW
- Epochs: 20
- Fine-tuned on Food Image Classification Dataset by using 12 varieties of these 35 varieties
- Model Performance: Achieved an accuracy of 96.23% on all of the kinds of Food Image Classification Dataset
- Dataset Name: 'Food Image Classification Dataset'
- Dataset Size: 24K unique images
- Dataset Description: 35 varieties of both Indian and Western appetizers
- Dataset Source: The unique images obtained from various Google resources by Kaggle user Harish Kumar
- Dataset License: CC0: Public Domain
- Dataset Quality:
- Meticulously curated images ensuring diversity and representativeness
- Provides a solid foundation for developing robust and precise image classification algorithms
- Encourages exploration in the fascinating field of food image classification)
- Expected Update Frequency: Quarterly
You can interact with the model through out Hugging Face Spaces platform. The intuitive interactive demonstration allows you to upload an image of food or use the examples and then get the model's prediction of the food category.
Food Recognition Demonstration
Food Recognition Code Demonstration
Here, we've built a chatbot, which makes the project not only can perform food classification based on Vision Transformer but also incorporates OpenAI's API to facilitate real-time interactive dialogues. You can ask anything you wish to know more about the food you've uploaded, easily obtain the ingredients and methods needed to make it, and even recreate this delicious meal in your own kitchen tonight.
-
It can accurately identify more specific food categories than traditional CNN models.
-
It can accurately identify food under different environments and lighting conditions. For example, it can also achieve high accurate recognition in dim light environments and complex scenes.
-
As we used an open-source model, it's possible to retrain the model to recognize specific traditional and specialty cuisines, such as Scottish traditional food Haggis or Beijing roast duck.
-
The Food Recognition with Chatbot Version can easily interact with OpenAI's ChatGPT to get more information about the food, such as its history, ingredients, and recipes.
-
Extending the model to recognize more categories of food, such as desserts, main courses, and beverages.
-
Extending it to recognize more categories, not just types of food, but also types of plants, animals, and so on, making its functionality more diverse.
-
Encapsulating the model into a smaller model and run it locally.
-
Trying multimodal recognition by combining the model with other models, such as image generation models, to generate images of food based on the your text description.
-
Combining it with other vision transformer models (Swin) and also CNNs, to improve the accuracy and efficiency of recognition.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
- Google Research Vision Transformer
- Google ViT Model Pre-trained on ImageNet-21k
- Food Image Classification Dataset
- Fine-Tune ViT for Image Classification with 🤗 Transformers
- Food Recognition Benchmark Starter Kit