cedadev · dwest77a · Mar 8, 2024 · Feb 20, 2024 · Feb 20, 2024 · Feb 20, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,10 @@
 .ipynb_checkpoints
 build_venv/
+kvenv/
 temp/
 testing/
 *__pycache__*
+.vscode/
 docs/build/
 build/
 pipeline.egg-info
diff --git a/assess.py b/assess.py
diff --git a/config/setup-cci.sh b/config/setup-cci.sh
@@ -0,0 +1,7 @@
+module load jaspy
+
+export WORKDIR=/gws/nopw/j04/esacci_portal/kerchunk_conversion/
+export SRCDIR=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
+export KVENV=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv
+
+source $KVENV/bin/activate
diff --git a/extensions/templates/setup-cmip6.sh → config/setup-cmip6.sh b/extensions/templates/setup-cmip6.sh → config/setup-cmip6.sh
@@ -1,4 +1,8 @@
+module load jaspy
+
 export WORKDIR=/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline
 export GROUPDIR=/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline/groups/CMIP6_rel1_6233
 export SRCDIR=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
-export KVENV=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv
+export KVENV=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv
+
+source $KVENV/bin/activate
diff --git a/extensions/templates/setup-metoff.sh → config/setup-metoff.sh b/extensions/templates/setup-metoff.sh → config/setup-metoff.sh
diff --git a/config/setup-project.sh.template b/config/setup-project.sh.template
@@ -0,0 +1,7 @@
+module load jaspy
+
+export WORKDIR =/path/to/kerchunk-pipeline
+export SRCDIR  =/gws/nopw/j04/cedaproc/kerchunk_builder/kerchunk-builder
+export KVENV   =$SRCDIR/kvenv
+
+source $KVENV/bin/activate
diff --git a/docs/source/assess-overview.rst b/docs/source/assess-overview.rst
@@ -4,45 +4,89 @@ Assessor Tool
 The assessor script ```assess.py``` is an all-purpose pipeline checking tool which can be used to assess:
  - The current status of all datasets within a given group in the pipeline (which phase each dataset currently sits in)
  - The errors/outputs associated with previous job runs.
+ - Specific logs from datasets which are presenting a specific type of error.
+
+An example command to run the assessor tool can be found below:
+::
+    python assess.py <operation> <group>
+
+Where the operation can be one of the below options:
+ - Progress: Get a general overview of the pipeline; how many datasets have completed or are stuck on each phase.
+ - Summarise: Get an assessment of the data processed for this group
+ - Display: Display a specific type of information about the pipeline (blacklisted codes, datasets with virtual dimensions or using parquet)
 
 1. Overall Progress of the Pipeline
 -----------------------------------
 
 To see the general status of the pipeline for a given group:
 ::
 
-    python assess.py <group> progress
+    python assess.py progress <group>
+
+An example output from this command can be seen below:
+::
+
+    Group: cci_group_v1
+    Total Codes: 361
+
+    scan      : 1     [0.3 %] (Variety: 1)
+        - Complete                 : 1
+
+    complete  : 185   [51.2%] (Variety: 1)
+        - complete                 : 185
+
+    unknown   : 21    [5.8 %] (Variety: 1)
+        - no data                  : 21
+
+    blacklist : 162   [44.9%] (Variety: 7)
+        - NonKerchunkable          : 50
+        - PartialDriver            : 3
+        - PartialDriverFail        : 5
+        - ExhaustedMemoryLimit     : 64
+        - ExhaustedTimeLimit       : 18
+        - ExhaustedTimeLimit*      : 1
+        - ValidationMemoryLimit    : 21
+
+In this case there are 185 datasets that have completed the pipeline with 1 left to be scanned. The 21 unknowns have no log file so there is no information on these. This will be resolved in later versions where a `seek` function will automatically run when checking the progress, to fix gaps in the logs for missing datasets.
 
 
 An example use case is to write out all datasets that require scanning to a new label (repeat_label):
 ::
 
-    python assess.py <group> progress -p scan -r <label_for_scan_subgroup> -W
+    python assess.py progress <group> -p scan -r <label_for_scan_subgroup> -W
 
 
 The last flag ```-W``` is required when writing an output file from this program, otherwise the program will dryrun and produce no files.
 
 2. Checking errors
 ------------------
-Check what repeat labels are available already using
+Check what repeat labels are available already using:
 ::
 
-    python assess.py <group> errors -s labels
-
+    python assess.py display <group> -s labels
 
-Show what jobs have previously run
+For listing the status of all datasets from a previous repeat idL
 ::
 
-    python assess.py <group> errors -s jobids
+    python assess.py progress <group> -r <repeat_id>
 
 
-For showing all errors from a previous job run
+For selecting a specific type of error (-e) and examine the full log for each example (-E)
 ::
 
-    python assess.py <group> errors -j <jobid>
+    python assess.py progress <group> -r <old_id> -e "type_of_error" -p scan -E
+
+Following from this, you may want to rerun the pipeline for just one type of error previously found:
+::
+
+    python assess.py progress <group> -r <old_repeat_id> -e "type_of_error" -p scan -n <new_repeat_id>
+
+Note that if you are looking at a specific repeat ID, you can forego the phase (-p) flag, since it is expected this set would appear in the same phase anyway.
 
+3. Special Display options
+--------------------------
 
-For selecting a specific type of error to investigate (-i) and examine the full log for each example (-E)
+Check how many of the datasets in a group have virtual dimensions
 ::
 
-    python assess.py test errors -j <jobid> -i "type_of_error" -E
+    python assess.py display <group> -s virtuals
diff --git a/docs/source/cci_water.rst b/docs/source/cci_water.rst
@@ -0,0 +1,139 @@
+CCI Water Vapour Example
+========================
+
+The CCI water vapour input CSV file can be found in ``extensions/example_water_vapour/`` within this repository. This guide will take you through running the pipeline for this example set of 4 datasets.
+Assuming you have already gone through the setup instructions in *Getting Started*, you can now proceed with creating a group for this test dataset.
+
+A new *group* is created within the pipeline using the ``init`` operation as follows:
+
+::
+
+    python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v
+
+.. note::
+    Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages.
+    Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline.
+
+The output of the above command should look something like this:
+
+.. code-block:: console
+
+    INFO [main-group]: Running init steps as serial process
+    INFO [init]: Starting initialisation
+    INFO [init]: Copying input file from relative path - resolved to /home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
+    INFO [init]: Creating project directories
+    INFO [init]: Creating directories/filelists for 1/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Creating directories/filelists for 2/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Creating directories/filelists for 3/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Creating directories/filelists for 4/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Created 24 files, 8 directories in group my_new_group
+    INFO [init]: Written as group ID: my_new_group
+
+Ok great, we've initialised the pipeline for our new group! Here's a summary diagram of what directories and files were just created:
+
+::
+
+    WORKDIR
+      - groups
+         -  my_new_group
+             -  proj_codes
+                 -  main.txt
+             -  blacklist_codes.txt
+             -  datasets.csv # (a copy of the input file)
+
+      - in_progress
+         -  my_new_group
+             -  code_1 # (codes 1 to 4 in this example)
+                 -  allfiles.txt
+                 -  base-cfg.json
+                 -  phase_logs
+                     -  scan.log
+                     -  compute.log
+                     -  validate.log
+                 -  status_log.csv
+
+For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next:
+
+::
+
+    python assess.py progress my_new_group
+
+Upon which your output should look something like this:
+
+.. code-block:: console
+
+    Group: my_new_group
+    Total Codes: 4
+
+    Pipeline Current:
+
+    init      : 4     [100.%] (Variety: 1)
+        - complete : 4
+
+    Pipeline Complete:
+
+    complete  : 0     [0.0 %]
+
+All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet.
+
+The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline.
+
+.. note::
+    For each of the above phases, jobs will be submitted to SLURM when using the ``group_run`` script. Please make sure to wait until all jobs are complete for one phase *before* running the next job.
+    After each job, check the progress of the pipeline with the same command as before to check all the datasets ``complete`` as expected. See below on what to do if datasets encounter errors.
+
+.. code-block:: console
+
+    python group_run.py scan my_new_group
+    python group_run.py compute my_new_group
+    python group_run.py validate my_new_group
+
+An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below:
+
+.. code-block:: console
+
+    Group: cci_group_v1
+    Total Codes: 361
+
+    Pipeline Current:
+
+    compute   : 21    [5.8 %] (Variety: 2)
+        - complete                 : 20
+        - KeyError 'refs'          : 1
+
+    Pipeline Complete:
+
+    complete  : 185   [51.2%]
+
+    blacklist : 155   [42.9%] (Variety: 8)
+        - NonKerchunkable          : 50
+        - PartialDriver            : 3
+        - PartialDriverFail        : 5
+        - ExhaustedMemoryLimit     : 56
+        - ExhaustedTimeLimit       : 18
+        - ExhaustedTimeLimit*      : 1
+        - ValidationMemoryLimit    : 21
+        - ScipyDimIssue            : 1
+
+In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section). 
+Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below:
+
+.. code-block:: console
+
+    python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E
+
+This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor:
+
+.. code-block:: console
+
+    Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs'
+    Rerun suggested command:    python single_run.py compute 218 -G cci_group_v1 -vv -d
+
+This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times.
+
+
+
diff --git a/docs/source/examples.rst b/docs/source/examples.rst
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -19,7 +19,7 @@ The pipeline consists of four central phases, with an additional phase for inges
 
    Introduction <pipeline-overview>
    Getting Started <start>
-   Worked Examples <examples>
+   Example CCI Water Vapour <cci_water>
    Pipeline Flags/Options <execution>
    Assessor Tool Overview <assess-overview>
    Error Codes <errors>