Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflows and metadataflows #214

Open
brockfanning opened this issue Mar 2, 2021 · 4 comments
Open

Dataflows and metadataflows #214

brockfanning opened this issue Mar 2, 2021 · 4 comments

Comments

@brockfanning
Copy link
Contributor

In SDMX there is the concept of a "dataflow" or "metadataflow", which (as I understand it) is a way to filter the output according to some constraints. We may be able to implement something like that here.

One use-case that definitely exists is in our SDMX output. Many countries may be interested in using the SDMX output in order to submit their data to the UNSD's database. However, this is not possible if the data uses any non-global codes/dimensions. So it would be useful to have a "dataflow" which filters the output to only including global codes/dimensions.

Ideally this filtering would be applied to the data in its internal DataFrame form, so that the feature could be used regardless of whether the output is going to be SDMX, GeoJSON, etc.

@brockfanning
Copy link
Contributor Author

For data, maybe the mechanism for this could be a "skip_invalid_data" setting. This would depend on something like #20 (being worked on in #190). What I'm thinking is, that when outputting the data, if this setting is true, then any row which has a disaggregation/unit/series value that is not part of the data schema will be skipped.

For example, to take the case of the SDMX for global usage: The data schema would be imported from the global SDMX DSD. Then any data row that uses custom disaggregations (like sub-national REF_AREAs, etc.) will be omitted in the output.

@LucyGwilliamAdmin
Copy link
Contributor

@brockfanning is this done/partly done? sounds familiar

@brockfanning
Copy link
Contributor Author

@LucyGwilliamAdmin Partly, I'd say.

What I describe in the example above we definitely already have - with the "constrain_data" and "constrain_metadata" parameters.

We also have the "global_content_constraints" which similarly drops rows of data that don't comply with the global content constraints (like that certain series have to be female, etc.).

A couple of things, I think, still need to be done, regarding that "global_content_constraints" parameter:

  1. Abdulla has pointed out (rightly) that it should not silently drop the rows. Instead it should actually fail and abort the build. Countries should actually fix these issues rather than just skipping the data.
  2. Right now this behavior is informed by a hardcoded CSV file. Eventually the SDMX working group plans to put these constraints into the "dataflow". Maybe, whenever this happens we can revisit and change our code to use that dataflow instead of the CSV file.

Thoughts?

@LucyGwilliamAdmin
Copy link
Contributor

@brockfanning thanks, that makes sense

  1. Yeah, I think does make sense that rows aren't silently dropped but also, wondering what should happen if country wants to show disseminate additional information that doesn't comply with DSD? I know that's what we'll want to do in UK
  2. I haven't learnt too much about SDMX dataflows but think that would make sense as means we wouldn't have to maintain a CSV

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants