Violet is a vision-language model designed to generate high-quality Arabic image captions. Built around a Gemini Decoder and a pretrained transformer, Violet bridges the gap between computer vision and natural language processing (NLP) for Arabic. The repository provides a simple, effective, and streamlined pipeline that can handle a variety of image formats and produce descriptive captions in Arabic with minimal effort.
This repository is built on the model proposed in the paper Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder.
- Arabic Image Captioning: Generate high-quality captions for images in Arabic.
- Visual Feature Extraction: Extract image features for integration into vision-language models or downstream tasks.
- Mixed Input Support: Handle batches of images in various formats, such as URLs, file paths, NumPy arrays, PyTorch tensors, and PIL Image objects.
- Pretrained Model: Leverages a robust pretrained model, requiring no additional training
pip install git+https://github.com/Mahmood-Anaam/violet.git --quiet
git clone https://github.com/Mahmood-Anaam/violet.git
cd violet
pip install -e .
git clone https://github.com/Mahmood-Anaam/violet.git
cd violet
conda env create -f environment.yml
conda activate violet
pip install -e .
from violet.pipeline import VioletImageCaptioningPipeline
from violet.configuration import VioletConfig
# Initialize the pipeline
pipeline = VioletImageCaptioningPipeline(VioletConfig)
# Caption a single image
caption = pipeline("http://images.cocodataset.org/val2017/000000039769.jpg")
print(caption)
# Caption a batch of images
images = [
"http://images.cocodataset.org/val2017/000000039769.jpg",
"/path/to/local/image.jpg",
np.random.rand(224, 224, 3),
torch.randn(3, 224, 224),
Image.open("/path/to/pil/image.jpg"),
]
captions = pipeline(images)
for caption in captions:
print(caption)
If needed, extract visual features for further processing:
# Extract features from a single image
features = pipeline.generate_features("http://images.cocodataset.org/val2017/000000039769.jpg")
print(features.shape)
# Extract features for a batch of images
batch_features = pipeline.generate_features(images)
print(batch_features.shape)
Generate captions from extracted visual features:
captions = pipeline.generate_captions_from_features(features)
for caption in captions:
print(caption)
Interactive Jupyter notebooks are provided to demonstrate Violet's capabilities. You can open these notebooks in Google Colab: