Skip to content

Latest commit

 

History

History
124 lines (87 loc) · 5.16 KB

data.md

File metadata and controls

124 lines (87 loc) · 5.16 KB

Preparing Data for YOLO-World

Overview

For pre-training YOLO-World, we adopt several datasets as listed in the below table:

Data Samples Type Boxes
Objects365v1 609k detection 9,621k
GQA 621k grounding 3,681k
Flickr 149k grounding 641k
CC3M-Lite 245k image-text 821k

Dataset Directory

We put all data into the data directory, such as:

├── coco
│   ├── annotations
│   ├── lvis
│   ├── train2017
│   ├── val2017
├── flickr
│   ├── annotations
│   └── images
├── mixed_grounding
│   ├── annotations
│   ├── images
├── mixed_grounding
│   ├── annotations
│   ├── images
├── objects365v1
│   ├── annotations
│   ├── train
│   ├── val

NOTE: We strongly suggest that you check the directories or paths in the dataset part of the config file, especially for the values ann_file, data_root, and data_prefix.

We provide the annotations of the pre-training data in the below table:

Data images Annotation File
Objects365v1 Objects365 train objects365_train.json
MixedGrounding GQA final_mixed_train_no_coco.json
Flickr30k Flickr30k final_flickr_separateGT_train.json
LVIS-minival COCO val2017 lvis_v1_minival_inserted_image_name.json

Acknowledgement: We sincerely thank GLIP and mdetr for providing the annotation files for pre-training.

Dataset Class

For fine-tuning YOLO-World on Close-set Object Detection, using MultiModalDataset is recommended.

Setting CLASSES/Categories

If you use COCO-format custom datasets, you "DO NOT" need to define a dataset class for custom vocabularies/categories. Explicitly setting the CLASSES in the config file through metainfo=dict(classes=your_classes), is simple:

coco_train_dataset = dict(
    _delete_=True,
    type='MultiModalDataset',
    dataset=dict(
        type='YOLOv5CocoDataset',
        metainfo=dict(classes=your_classes),
        data_root='data/your_data',
        ann_file='annotations/your_annotation.json',
        data_prefix=dict(img='images/'),
        filter_cfg=dict(filter_empty_gt=False, min_size=32)),
    class_text_path='data/texts/your_class_texts.json',
    pipeline=train_pipeline)

For training YOLO-World, we mainly adopt two kinds of dataset classs:

1. MultiModalDataset

MultiModalDataset is a simple wrapper for pre-defined Dataset Class, such as Objects365 or COCO, which add the texts (category texts) into the dataset instance for formatting input texts.

Text JSON

The json file is formatted as follows:

[
    ['A_1','A_2'],
    ['B'],
    ['C_1', 'C_2', 'C_3'],
    ...
]

We have provided the text json for LVIS, COCO, and Objects365

2. YOLOv5MixedGroundingDataset

The YOLOv5MixedGroundingDataset extends the COCO dataset by supporting loading texts/captions from the json file. It's desgined for MixedGrounding or Flickr30K with text tokens for each object.

🔥 Custom Datasets

For custom dataset, we suggest the users convert the annotation files according to the usage. Note that, converting the annotations to the standard COCO format is basically required.

  1. Large vocabulary, grounding, referring: you can follow the annotation format as the MixedGrounding dataset, which adds caption and tokens_positive for assigning the text for each object. The texts can be a category or a noun phrases.

  2. Custom vocabulary (fixed): you can adopt the MultiModalDataset wrapper as the Objects365 and create a text json for your custom categories.

CC3M Pseudo Annotations

The following annotations are generated according to the automatic labeling process in our paper. Adn we report the results based on these annotations.

To use CC3M annotations, you need to prepare the CC3M images first.

Data Images Boxes File
CC3M-246K 246,363 820,629 Download 🤗
CC3M-500K 536,405 1,784,405 Download 🤗
CC3M-750K 750,000 4,504,805 Download 🤗