SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing #131
brunosan
started this conversation in
Related work
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This paper describes how to use OSM data and CLIP learning mechanisms to create a remote sensing vision language model.
Being able to relate text description to images is extremely useful. CLIP is such tool, developed on various sources of images and its descriptions around the web. If you have that, the approach (like in #57) is to learn by tweaking the text and image embeddings towards their pairs and away from the others. There is no such pairs of images and descriptions at scale for remote sensing (let's call them rasters)
The key insight from @wangzhecheng et al. here is to pull OSM tags as the descriptions, and use CLIP embeddings both to help curate OSM descriptions and as pre-trained starting weights.
CLIP is the learning mechanism. You need two learnable architectures creating the embeddings, one for rasters and one for text.
In both cases they start with Open AI CLIP pre-trained architectures (on internet images). For images ViTs, and for text transformers. The idea is to lean on all the basic concepts. When training, the more they freeze these pre-trained models, the worse it performs, which seems to highlight the need for special datasets and to train both sides.
For the rasters themselves they use a combination of various datasets, from Sentinel, to NAIP or Planet to span resolutions from 1cm to 30m.
Text description preparation.
key:value
tags.key:value
tags they consider relevant, they add their location of all occurrences to the selection.key:value
, they run it on a (normal) Open AI CLIP classifier to predict what resolution it can be seen (e.g. a house won't be seen on 30m/pixel). This creates a map of possible image sources to make raster-text pairs.highway IS paved
).key:value
tag is a point, then create a bounding box with a span appropiate for the resolution. They add some noise not to have the object exactly in the middle always.key:value
tag is a way or a polygon, this defines the extent of the image. This extent should make sense for the raster resolution. They drop cases where the extent and resolution differ too much.key:value
that created the inclusion on the dataset, and another set pulling all otherkey:value
present on the bounding box of that raster.Semantic coverage
The final dataset seems to have great global coverage (with dense sampling where high res AND OSM tags are dense. There are also 580 tags have ≥1,000 images, and more than 1,800 tags have ≥100 images included in the dataset.
While it is surprising that 100 images are enough, I am guessing that using Open AI CLIP pre-trained embeddings helps carry over the semantic relations of these words.
Relevance to Clay
Using OSM was one of the ideas @yellowcap and @srmsoumya had already considered, and this paper demonstrates the needed detail and decisions to actually execute the idea.
They mention that LAION-2B does include 726K descriptions of rasters. (LAION legal status is being disputed since it includes PII and has been subject to GDPR claims).
It was really interesting to see that classification of broad classes was more difficult than classification of narrow classes (e.g. detecting roads of roofs was harder than classifying the type of road of roof). Especially when this was done using raster from other sources (e.g. Bing search).
While readily available, we cannot use their model, nor datasets, since it was made with Google Earth Engine, Planet, and potentially other elements with license restrictions.
Their leverage of Open AI CLIP makes me wonder if we could use an LLM to transform
key:value
machine dictionaries into descriptive text. This would leverage more the internal relationships of the pre-trained text embeddings while keeping the relevantkey:value
tokens.cc @danhammer
Beta Was this translation helpful? Give feedback.
All reactions