Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ColumnSynthesizer should follow the sdtypes in the metadata (not the data's dtypes) #249

Open
npatki opened this issue Jun 5, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@npatki
Copy link

npatki commented Jun 5, 2023

Environment Details

  • SDGym version: 0.6.0 (latest)

What is expected

The ColumnSynthesizer is expected to independently model each column.

  • For numerical or datetime sdtypes, it should learn a univariate GMM during fit. Then during sample, it can create data from it.
  • For categorical or boolean sdtypes, it should learn the frequencies of each category. Then during sample, it can create data using those frequencies as weights.
  • For other sdtypes (such as id, pii, etc.), it can simply use the RegexGenerator or AnonymizedFaker to generate values from scratch (no learning is expected)

How does this synthesizer know which type is which? It should use the provided metadata as the ground source of truth.

What is actually observed

Similar to the UniformSynthesizer (see #248), this synthesizer just lets the RDT HyperTransformer decide which column is which sdtype (based on the data).

It should be referencing the metadata, since the metadata is the source of truth.

@npatki npatki added the bug Something isn't working label Jun 5, 2023
@npatki npatki changed the title The IndependentSynthesizer should follow the sdtypes in the metadata (not the data's dtypes) The ColumnSynthesizer should follow the sdtypes in the metadata (not the data's dtypes) Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant