SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing #131

brunosan · 2024-01-22T19:52:28Z

brunosan
Jan 22, 2024
Maintainer

This paper describes how to use OSM data and CLIP learning mechanisms to create a remote sensing vision language model.

Being able to relate text description to images is extremely useful. CLIP is such tool, developed on various sources of images and its descriptions around the web. If you have that, the approach (like in #57) is to learn by tweaking the text and image embeddings towards their pairs and away from the others. There is no such pairs of images and descriptions at scale for remote sensing (let's call them rasters)

The key insight from @wangzhecheng et al. here is to pull OSM tags as the descriptions, and use CLIP embeddings both to help curate OSM descriptions and as pre-trained starting weights.

CLIP is the learning mechanism. You need two learnable architectures creating the embeddings, one for rasters and one for text.
In both cases they start with Open AI CLIP pre-trained architectures (on internet images). For images ViTs, and for text transformers. The idea is to lean on all the basic concepts. When training, the more they freeze these pre-trained models, the worse it performs, which seems to highlight the need for special datasets and to train both sides.

For the rasters themselves they use a combination of various datasets, from Sentinel, to NAIP or Planet to span resolutions from 1cm to 30m.

Text description preparation.

They start with a random sampling globally or regions. For each region they pull all its OSM key:value tags.
For rare key:value tags they consider relevant, they add their location of all occurrences to the selection.
For each key:value, they run it on a (normal) Open AI CLIP classifier to predict what resolution it can be seen (e.g. a house won't be seen on 30m/pixel). This creates a map of possible image sources to make raster-text pairs.
For each location, they create concatenated strings (e.g. highway IS paved).
1. If the key:value tag is a point, then create a bounding box with a span appropiate for the resolution. They add some noise not to have the object exactly in the middle always.
2. If the key:value tag is a way or a polygon, this defines the extent of the image. This extent should make sense for the raster resolution. They drop cases where the extent and resolution differ too much.
They create to sets of descriptions. One with only the key:value that created the inclusion on the dataset, and another set pulling all other key:value present on the bounding box of that raster.
Finally, they run the raster-text through a normal Open AI CLIP to describe the raster as if it is a normal image, and compare the text description from this CLIP to the proposed raster description. They only retain cases with a minimum cosine similarity (they create several SkyScript for each case). This aims to solve cases where the raster description is too far from an image description (e.g. there's a cloud on that particular image and you can't see the building).

Semantic coverage

The final dataset seems to have great global coverage (with dense sampling where high res AND OSM tags are dense. There are also 580 tags have ≥1,000 images, and more than 1,800 tags have ≥100 images included in the dataset.

While it is surprising that 100 images are enough, I am guessing that using Open AI CLIP pre-trained embeddings helps carry over the semantic relations of these words.

Relevance to Clay

Using OSM was one of the ideas @yellowcap and @srmsoumya had already considered, and this paper demonstrates the needed detail and decisions to actually execute the idea.

They mention that LAION-2B does include 726K descriptions of rasters. (LAION legal status is being disputed since it includes PII and has been subject to GDPR claims).

It was really interesting to see that classification of broad classes was more difficult than classification of narrow classes (e.g. detecting roads of roofs was harder than classifying the type of road of roof). Especially when this was done using raster from other sources (e.g. Bing search).

While readily available, we cannot use their model, nor datasets, since it was made with Google Earth Engine, Planet, and potentially other elements with license restrictions.

Their leverage of Open AI CLIP makes me wonder if we could use an LLM to transform key:value machine dictionaries into descriptive text. This would leverage more the internal relationships of the pre-trained text embeddings while keeping the relevant key:value tokens.

cc @danhammer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing #131

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing #131

brunosan Jan 22, 2024 Maintainer

Text description preparation.

Semantic coverage

Relevance to Clay

Replies: 0 comments

brunosan
Jan 22, 2024
Maintainer