You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Design for improving the tool so new data sources can be added easily
The plan for what the configuration should look like is in #2371. Example dkrz-intake.yml configuration file:
projects:
CMIP6:
data:
CMIP6-intake-esm:
type: esmvalcore.intake.IntakeDataSourcefile: '/pool/data/Catalogs/levante-cmip6.json'facets:
# mapping from recipe facets to intake-esm catalog facetsactivity: activity_iddataset: source_idensemble: member_idexp: experiment_idgrid: grid_labelinstitute: institution_idmip: table_idshort_name: variable_idversion: version
We plan to ship these configuration files with ESMValCore for supported data sources, so users won't have to configure this by themselves. The configuration mentions the name of a class (e.g. esmvalcore.local.DataSource and any arguments that are needed to construct an instance of it, that can be used to find 'data'. This class should have a method to find the data, e.g. named find_files or find_data, that will take the facets from the recipe (plus any automatically added facets) as arguments and return an object/a list of objects that can be used to access the data (e.g. esmvalcore.local.LocalFile or esmvalcore.esgf.ESGFFile, but maybe this could also be Iris Cubes or Xarray Datasets, I'm not sure if intermediate objects are needed in cases where constructing an Iris Cube or Xarray Dataset is really fast). The found 'data object's may then be passed on to the esmvalcore.preprocessor.load function or alternatively, be inserted somewhere in the esmvalcore.dataset.Dataset.load method (skipping the current load/fix/concatenate functions, but CMOR checking will need to be done regardless), and enter the preprocessing chain from there.
Some additional things to consider:
For some data sources we need the ability to deduplicate input data across multiple data sources, in particular for the CMIP data (use case: most data available in a centrally managed directory and other data in a user managed directory). This could e.g. be done by adding a name and version attribute (currently this is done based on filename and 'version' facet here) and having a generic function that is applied to all input data objects and filters it so there is only one data object for each name and it is the requested (or latest) version. Individual data sources must also have the ability to deduplicate, e.g. here. Side note: the current implementation does not work if newer versions of files use different filenames because the time slices stored in the files are different from the old version, this issue is probably hidden by the concatenate preprocessor function that takes out duplicated data, but not necessarily the correct bits.
Fixes are often specific to the data source, but there can also be overlap. Therefore they should probably be applied as part of the data object load instead of in the generic esmvalcore.dataset.Dataset.load function.
In the future we would like to make fixes standalone Python packages based on Xarray (perhaps in combination with ncdata) so they will have larger community uptake and contributions. It seems likely that there will be one fixes package per project (.e.g. CMIP7, CMIP6, CORDEX-CMIP6) and data source (e.g. NetCDF files, Xarray datasets, Zarr).
We would like to add support for intake-esm, intake-esgf, xcube, and possibly more (e.g. intake-STAC or PySTAC), so the design should be easy to extend with additional data sources
Design for improving the tool so new data sources can be added easily
The plan for what the configuration should look like is in #2371. Example
dkrz-intake.yml
configuration file:We plan to ship these configuration files with ESMValCore for supported data sources, so users won't have to configure this by themselves. The configuration mentions the name of a class (e.g. esmvalcore.local.DataSource and any arguments that are needed to construct an instance of it, that can be used to find 'data'. This class should have a method to find the data, e.g. named
find_files
orfind_data
, that will take the facets from the recipe (plus any automatically added facets) as arguments and return an object/a list of objects that can be used to access the data (e.g. esmvalcore.local.LocalFile or esmvalcore.esgf.ESGFFile, but maybe this could also be Iris Cubes or Xarray Datasets, I'm not sure if intermediate objects are needed in cases where constructing an Iris Cube or Xarray Dataset is really fast). The found 'data object's may then be passed on to the esmvalcore.preprocessor.load function or alternatively, be inserted somewhere in the esmvalcore.dataset.Dataset.load method (skipping the current load/fix/concatenate functions, but CMOR checking will need to be done regardless), and enter the preprocessing chain from there.Some additional things to consider:
name
andversion
attribute (currently this is done based on filename and 'version' facet here) and having a generic function that is applied to all input data objects and filters it so there is only one data object for eachname
and it is the requested (or latest) version. Individual data sources must also have the ability to deduplicate, e.g. here. Side note: the current implementation does not work if newer versions of files use different filenames because the time slices stored in the files are different from the old version, this issue is probably hidden by theconcatenate
preprocessor function that takes out duplicated data, but not necessarily the correct bits.esmvalcore.dataset.Dataset.load
function.esmvalcore.local
andesmvalcore.esgf
modules byintake-esgf
, depending on how that develops.The text was updated successfully, but these errors were encountered: