Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog vocabulary slightly incompatible with example analysis script usage #120

Open
ceblanton opened this issue May 2, 2024 · 7 comments

Comments

@ceblanton
Copy link
Collaborator

ceblanton commented May 2, 2024

FRE Canopy is generating catalogs using:

module load fre/canopy

fre catalog build --overwrite -i $ppdir -o $ppdir/catalog

sed -i.bak -e 's/,P1M,/,monthly,/' $ppdir/catalog.csv

An example pp directory and catalog file are here:

  • /archive/Chris.Blanton/am5/am5f7b11r0/c96L65_am5f7b11r0_amip/gfdl.ncrc5-deploy-prod-openmp/pp
  • /archive/Chris.Blanton/am5/am5f7b11r0/c96L65_am5f7b11r0_amip/gfdl.ncrc5-deploy-prod-openmp/pp/catalog.json

The example analysis script usage (the Ray example) is:

module load python/3.9

source /net2/rlm/analysis-scripts/example/env/bin/activate

python3 -c "from freanalysis_clouds import CloudAnalysisScript; CloudAnalysisScript().run_analysis('/archive/Chris.Blanton/am5/am5f7b11r0/c96L65_am5f7b11r0_amip/gfdl.ncrc5-deploy-prod-openmp/pp/catalog.json', '/nbhome/$USER/sample-output')"

That fails with this message

/net2/rlm/analysis-scripts/example/env/lib/python3.9/site-packages/pydantic/deprecated/decorator.py:222: UserWarning: There are no datasets to load! Returning an empty dictionary.

  return self.raw_function(**d, **var_kwargs)

Traceback (most recent call last):

  File "<string>", line 1, in <module>

  File "/net2/rlm/analysis-scripts/example/env/lib/python3.9/site-packages/freanalysis_clouds/__init__.py", line 125, in run_analysis

    datasets[self.metadata.catalog_key(variable)],

KeyError: 'c96L65_am5f4b4r1-newrad_amip.monthly.na.atmos.high_cld_amt'

The mystery is that this very-similar catalog works:

/net2/rlm/analysis-scripts/example/catalog.json

The difference we think is "n/a" versus missing for the ensemble vocabulary.

Hopefully, the "fre catalog validate /path/to/schema.json /path/to/catalog-to-test.json" usage can detect this mismatch or inconsistency before we try to launch the script.

@aradhakrishnanGFDL
Copy link
Owner

cat = cat.search(variable_id="high_cld_amt")
dset_dict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})

--> The keys in the returned dictionary of datasets are constructed as follows:
'source_id.experiment_id.frequency.modeling_realm.variable_id.chunk_freq'

████████████████████████████████████████████████████████████████████████████████████████| 100.00% [2/2 00:04<00:00]
dset_dict.keys()
dict_keys(['am5.c96L65_am5f7b11r0_amip.P1M.atmos_level.high_cld_amt.P1Y', 'am5.c96L65_am5f7b11r0_amip.P1M.atmos.high_cld_amt.P1Y'])

@aradhakrishnanGFDL
Copy link
Owner

@ceblanton member_id is empty "" , when it's empty the logic in Ray's script perhaps should be to remove it in key name?

@aradhakrishnanGFDL
Copy link
Owner

or we enforce no null which may be something we discussed before.

@aradhakrishnanGFDL
Copy link
Owner

on May 9th, it was decided to use "na" as the default value for the aggregate columns rather than the empty values, to help maintain a "key pattern" at the early stage of adopting this. Down the line, we will provide examples to dynamically query for the dataset/key names.

@aradhakrishnanGFDL
Copy link
Owner

@ceblanton

PR is ready for member_id to be "na" by default. But, I realize Ray's key still is missing the chunk frequency which is an aggregate column. I am not sure if leaving it in the key or using a default for chunk_freq is a good idea. We can't possibly find unique datasets without that. But this also circles back to not having to hard-code these key names.

this now works:

am5.c96L65_am5f7b11r0_amip.P1M.na.atmos_level.high_cld_amt.P1Y

You can test:


import intake, intake_esm
cat = /home/a1r/cat/canopy/am5f7b11r0/c96L65_am5f7b11r0_amipn0513.json

import intake,intake_esm

cat = intake.open_esm_datastore(col)
cat_store = intake.open_esm_datastore(cat)

cat_subset = cat_store.search(variable_id="high_cld_amt")

dset_dict = cat_subset.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})

#this gives the dataset names dynamically based on the search and existing catalog+spec. 

for k in dset_dict.keys(): 
    print(k)

#test for the new key that is expected to work

dset_dict['am5.c96L65_am5f7b11r0_amip.P1M.na.atmos_level.high_cld_amt.P1Y']

@aradhakrishnanGFDL aradhakrishnanGFDL changed the title catalog vocabulary slightly incompatible with example analysis script uage catalog vocabulary slightly incompatible with example analysis script usage May 14, 2024
@aradhakrishnanGFDL
Copy link
Owner

figure generated : /nbhome/a1r/analysis-scripts/pngs/cloud-fraction.png

script used: https://github.com/aradhakrishnanGFDL/analysis-scripts/blob/prototype1-a1r/raytest.py

changes made are in my fork
and its only for one suite

https://github.com/aradhakrishnanGFDL/analysis-scripts/tree/prototype1-a1r/freanalysis_clouds

@aradhakrishnanGFDL
Copy link
Owner

to support this, we need to remove source_id from the aggregation columns. MDTF uses it though. so let's discuss.. @ceblanton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants