-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to access latest data and understand versions #88
Comments
I am experiencing the same issue |
The latest version of the data is always released here through an updated dataset-links.csv. There could be delays in other sources like planetary computer where there is a hand off. We use the dataset-links.csv and json format to try and maintain backward compatibility -- older releases were smaller and could be shared via country name or state name in the case of the US release. There is certainly a case to be made to 1) open the storage account for bulk transfer and 2) use a more compressed format like geoparquet (which could cause other user headaches). The readme updates and the dataset-links.csv dates do not match. Data deliveries are automated, but the readme is manual. We can update the documentation process to ensure dates align. |
+1 for geoparquet, as especially with tools like duckdb it is super efficient. Good to know that dataset-links.csv is always up to date, as that process is very easy to use, even if slower than parquet. |
Thanks @andwoi, that helps - I think having the dates in the folder structure align with the README note would have cleared up at least part of my confusion. Am I right to understand that the whole dataset is updated or re-provided through your delivery pipeline each release? Would it be possible/interesting to include a "last-updated" column and or an "imagery dates" column as additional metadata in |
First, thanks and much appreciation for making this both open and accessible, both via the dataset-links.csv linked from this repository and on the planetary computer data catalog!
I can happily access either source, but I'm not clear which data releases are available in each location, or how best to pull updates.
https://minedbuildings.blob.core.windows.net/global-buildings/dataset-links.csv lists all URLs with
2023-12-26
in the path. Is2023-12-26
the release date for all these files?the README.md here lists an update for
2024-01-03
, particularly for buildings in Brazil and Italy. Is this update included in files linked fromdataset-links.csv
?the planetary computer example notebook shows how to access the data as a Delta Table, which lists URIs under
2023-04-25/ml-buildings.parquet
, andtable.history()
gives a single WRITE operation at timestamp 1682774982678, around2023-04-29
. Are any of the more recent updates listed in this repository present in that parquet dataset, or are there plans to push updates there?Should I be aware of tools to help with bulk access or reading metadata for either location?
I can request a signed URL with an SAS token for the delta table blob storage container from https://planetarycomputer.microsoft.com/api/sas/v1/sign?href=https://bingmlbuildings.blob.core.windows.net/footprints/delta and give that to
azcopy list
, though I don't discover any other versions or updates there.I'm not sure how to directly access or list all files under
https://minedbuildings.blob.core.windows.net/global-buildings
, only directly accessing those listed in the CSV, or linked from the README in history here (e.g. Abyei. Are all versions (or all latest versions, with release/update metadata) intended to be accessible?The text was updated successfully, but these errors were encountered: