You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi folks 👋 - I was curious we've considered embedding metadata in the Parquet file schemas. The format does allow for adding arbitrary key value pairs at both the table and column levels. Here is a small python example 👇
First we create a toy arrow table with no metadata:
importpyarrowaspaimportpyarrow.parquetaspq# Create a toy dataset:no_meta_table=pa.table(dict(
characters=["Luke", "Han", "Leia", "Ben"],
is_jedi=[False, False, False, True],
))
# By default, we have no metadata here:assertno_meta_table.schema.metadataisNone
Then we can update the schema to add metadata (note that this can also be done during table creation):
metadata= {
"movie": "A New Hope",
"episode": "4",
"year": "1977",
}
meta_table=no_meta_table.replace_schema_metadata(metadata)
# We now have metadata here:print(meta_table.schema.metadata)
# {b'movie': b'A New Hope', b'episode': b'4', b'year': b'1977'}
This is a shallow copy, share data, but not the schema metadata with the other copy:
# We still have no metadata in the original table:assertno_meta_table.schema.metadataisNone
And we can persist this metadata in parquet files:
# Still no metadata after write->read:pq.write_table(no_meta_table, "no_meta.parquet")
parsed_no_meta_table=pq.read_table("no_meta.parquet")
assertparsed_no_meta_table.schema.metadataisNone# Persisted metadata after write-read:pq.write_table(meta_table, "meta.parquet")
parsed_meta_table=pq.read_table("meta.parquet")
print(parsed_meta_table.schema.metadata)
# {b'movie': b'A New Hope', b'episode': b'4', b'year': b'1977'}
My thought is that this may be a good way to embed metadata, such as what kind of table it is, sparingly. What do you think? The biggest downside I see for now is that the feature is not very well known/documented for Parquet and I'm not sure how well supported it is across Arrow APIs.
The text was updated successfully, but these errors were encountered:
I like the idea. I've been hacking around on a parquet-based replacement/supplement to mzML, and I think the metadata route is how I would go for storing some of the "less important" cvParams/metadata or things that are globally applied for a file.
As you mentioned, support for various parquet features can be somewhat scattered across the ecosystem. Polars, for instance, doesn't support the MAP (dictionary of key->value) column type.
Hi folks 👋 - I was curious we've considered embedding metadata in the Parquet file schemas. The format does allow for adding arbitrary key value pairs at both the table and column levels. Here is a small python example 👇
First we create a toy arrow table with no metadata:
Then we can update the schema to add metadata (note that this can also be done during table creation):
This is a shallow copy, share data, but not the schema metadata with the other copy:
And we can persist this metadata in parquet files:
My thought is that this may be a good way to embed metadata, such as what kind of table it is, sparingly. What do you think? The biggest downside I see for now is that the feature is not very well known/documented for Parquet and I'm not sure how well supported it is across Arrow APIs.
The text was updated successfully, but these errors were encountered: