Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to access latest data and understand versions #88

Open
tomalrussell opened this issue Jan 12, 2024 · 4 comments
Open

How to access latest data and understand versions #88

tomalrussell opened this issue Jan 12, 2024 · 4 comments

Comments

@tomalrussell
Copy link

First, thanks and much appreciation for making this both open and accessible, both via the dataset-links.csv linked from this repository and on the planetary computer data catalog!

I can happily access either source, but I'm not clear which data releases are available in each location, or how best to pull updates.

  • https://minedbuildings.blob.core.windows.net/global-buildings/dataset-links.csv lists all URLs with 2023-12-26 in the path. Is 2023-12-26 the release date for all these files?

  • the README.md here lists an update for 2024-01-03, particularly for buildings in Brazil and Italy. Is this update included in files linked from dataset-links.csv?

  • the planetary computer example notebook shows how to access the data as a Delta Table, which lists URIs under 2023-04-25/ml-buildings.parquet, and table.history() gives a single WRITE operation at timestamp 1682774982678, around 2023-04-29. Are any of the more recent updates listed in this repository present in that parquet dataset, or are there plans to push updates there?

Should I be aware of tools to help with bulk access or reading metadata for either location?

@edkry
Copy link

edkry commented Mar 13, 2024

I am experiencing the same issue

@andwoi
Copy link
Contributor

andwoi commented Mar 27, 2024

The latest version of the data is always released here through an updated dataset-links.csv. There could be delays in other sources like planetary computer where there is a hand off. We use the dataset-links.csv and json format to try and maintain backward compatibility -- older releases were smaller and could be shared via country name or state name in the case of the US release. There is certainly a case to be made to 1) open the storage account for bulk transfer and 2) use a more compressed format like geoparquet (which could cause other user headaches).

The readme updates and the dataset-links.csv dates do not match. Data deliveries are automated, but the readme is manual. We can update the documentation process to ensure dates align.

@johnphilippowell
Copy link

+1 for geoparquet, as especially with tools like duckdb it is super efficient. Good to know that dataset-links.csv is always up to date, as that process is very easy to use, even if slower than parquet.

@tomalrussell
Copy link
Author

Thanks @andwoi, that helps - I think having the dates in the folder structure align with the README note would have cleared up at least part of my confusion.

Am I right to understand that the whole dataset is updated or re-provided through your delivery pipeline each release? Would it be possible/interesting to include a "last-updated" column and or an "imagery dates" column as additional metadata in dataset-links.csv?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants