Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding metadata in Parquet files? #4

Open
wfondrie opened this issue Aug 25, 2023 · 1 comment
Open

Embedding metadata in Parquet files? #4

wfondrie opened this issue Aug 25, 2023 · 1 comment

Comments

@wfondrie
Copy link
Collaborator

wfondrie commented Aug 25, 2023

Hi folks 👋 - I was curious we've considered embedding metadata in the Parquet file schemas. The format does allow for adding arbitrary key value pairs at both the table and column levels. Here is a small python example 👇

First we create a toy arrow table with no metadata:

import pyarrow as pa
import pyarrow.parquet as pq

# Create a toy dataset:
no_meta_table = pa.table(dict(
    characters=["Luke", "Han", "Leia", "Ben"],
    is_jedi=[False, False, False, True],
))

# By default, we have no metadata here:
assert no_meta_table.schema.metadata is None

Then we can update the schema to add metadata (note that this can also be done during table creation):

metadata = {
    "movie": "A New Hope", 
    "episode": "4", 
    "year": "1977",
}

meta_table = no_meta_table.replace_schema_metadata(metadata)

# We now have metadata here:
print(meta_table.schema.metadata)
# {b'movie': b'A New Hope', b'episode': b'4', b'year': b'1977'}

This is a shallow copy, share data, but not the schema metadata with the other copy:

# We still have no metadata in the original table:
assert no_meta_table.schema.metadata is None

And we can persist this metadata in parquet files:

# Still no metadata after write->read:
pq.write_table(no_meta_table, "no_meta.parquet")
parsed_no_meta_table = pq.read_table("no_meta.parquet")
assert parsed_no_meta_table.schema.metadata is None

# Persisted metadata after write-read:
pq.write_table(meta_table, "meta.parquet")
parsed_meta_table = pq.read_table("meta.parquet")
print(parsed_meta_table.schema.metadata)
# {b'movie': b'A New Hope', b'episode': b'4', b'year': b'1977'}

My thought is that this may be a good way to embed metadata, such as what kind of table it is, sparingly. What do you think? The biggest downside I see for now is that the feature is not very well known/documented for Parquet and I'm not sure how well supported it is across Arrow APIs.

@lazear
Copy link
Collaborator

lazear commented Aug 25, 2023

I like the idea. I've been hacking around on a parquet-based replacement/supplement to mzML, and I think the metadata route is how I would go for storing some of the "less important" cvParams/metadata or things that are globally applied for a file.

As you mentioned, support for various parquet features can be somewhat scattered across the ecosystem. Polars, for instance, doesn't support the MAP (dictionary of key->value) column type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants