Best practices for loading static STAC catalogs #86

scottyhq · 2021-11-12T18:37:01Z

scottyhq
Nov 12, 2021

As illustrated in the examples, stackstac works amazingly well for loading up the results of a pystac_client search in milliseconds! This is partly because the search results are stored as a single JSON FeatureCollection so it's just one blob of metadata to read. However, with published static STAC catalogs all the items must be discovered by crawling links, which ends up being slow.

Static STACs are nice because they are easy to create and can be consumed by tools like STAC browser whereas STAC APIs seem quite challenging to set up. So my question is, what's the best way for stackstac to open static catalogs? Perhaps a new tool or function in pystac (pystac.stac_io.AsyncStacIO()?) or pystac_client that converts a static catalog to a single FeatureCollection very efficiently? Does such a thing already exist?

Curious for ideas from @TomAugspurger @sharkinsspatial @matthewhanson @duckontheweb - Note there are Zarr parallels here to a static catalog having distributed metadata but the api results having "consolidated metadata", as discussed here #50 (comment).

Below is an example of current static catalog performance using https://github.com/relativeorbit/aws-rtc-12SYJ

%%time 

import pystac # 1.1
import stackstac #0.2.1

root = 'https://raw.githubusercontent.com/relativeorbit/aws-rtc-12SYJ/main/catalog.json'
pystac.stac_io.DefaultStacIO()
cat = pystac.read_file(root)

stack = stackstac.stack(cat)
# CPU times: user 4.68 s, sys: 761 ms, total: 5.44 s
# Wall time: 13.9 s

TomAugspurger · 2021-11-12T18:41:56Z

TomAugspurger
Nov 12, 2021

pystac_client that converts a static catalog to a single FeatureCollection very efficiently?

I think that would be my recommendation: A way to quickly crawl a static STAC catalog, following links up to some depth / some limit of items, and collecting the results in an ItemCollection. Then stackstac doesn't need to change at all.

Edit: That said, you're still looking at one HTTP request per (sub)-catalog plus one HTTP request per item, which will add up even if you're making a bunch of requests concurrently. Would it be reasonable to do that "server-side", and collect the items into a static FeatureCollection, load that with pystac, and pass that to stackstac?

1 reply

gjoseph92 Nov 12, 2021
Maintainer

Then stackstac doesn't need to change at all.

Agreed. I think this would be a great tool to have in the STAC ecosystem generally, but doesn't really belong within the scope of stackstac. Once it exists though, stackstac could even use it automatically if we wanted.

It also might be reasonable for pystac to be this tool. Being able to crawl in parallel using async requests feels like reasonable feature (stac-utils/pystac#274).

Would it be reasonable to do that "server-side", and collect the items into a static FeatureCollection, load that with pystac

It would be pretty cool, but this also sounds like an extension to the STAC API, right @matthewhanson? The quickest way right now might actually be to use search APIs on catalogs that support it, since those can send data in bulk and support paging.

Perhaps this tool could look at the root item, and if there's a link to a search API, switch to using that (with no query parameters) to basically bulk-download the catalog? And if there's no search API, then just use the slowpath of crawling the catalog through a bunch of asyncio requests.

Since @scottyhq is specifically talking about static catalogs that don't have a search API though, I'd guess there's not much of a "server" on the backend to do this "server-side". I'm mostly imagining a bunch of JSONs stored in S3. There's really not much you can do to make that faster besides concurrent requests (which would make it a lot faster!).

matthewhanson · 2021-11-15T20:07:15Z

matthewhanson
Nov 15, 2021

I've thought about this idea a bit and agree, it's either PySTAC or pystac-client. I'd like someway to be able to go both ways really easily....either crawl a catalog and convert to an ItemCollection, or read in an ItemCollection and explode that into a catalog.

Certainly an API is ideal, but as @scottyhq points out, this can be useful for some data where there is a smaller number of STAC Items and not a big need to search (such as with modeled climate data stored in Zarr). There is certainly a limitation here in that it could be slow as 1 item=1 request, but async requests is on the roadmap for pystac-client which would help with that.

A provider could also prepackage multiple Items in a single FeatureCollection if it were a set of Items that would be used together, but this wouldn't adhere to the static STAC spec. We actually talked about this at one point but there were too many problems.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for loading static STAC catalogs #86

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best practices for loading *static* STAC catalogs #86

scottyhq Nov 12, 2021

Replies: 2 comments · 1 reply

TomAugspurger Nov 12, 2021

gjoseph92 Nov 12, 2021 Maintainer

matthewhanson Nov 15, 2021

Best practices for loading static STAC catalogs #86

scottyhq
Nov 12, 2021

Replies: 2 comments 1 reply

TomAugspurger
Nov 12, 2021

gjoseph92 Nov 12, 2021
Maintainer

matthewhanson
Nov 15, 2021