Updated docs with example

cedadev · Mar 8, 2024 · e4e7dac · e4e7dac
1 parent 8d4fc67
commit e4e7dac
Show file tree

Hide file tree

Showing 3 changed files with 147 additions and 11 deletions.
diff --git a/docs/source/cci_water.rst b/docs/source/cci_water.rst
@@ -0,0 +1,138 @@
+CCI Water Vapour Example
+========================
+
+The CCI water vapour input CSV file can be found in ``extensions/example_water_vapour/`` within this repository. This guide will take you through running the pipeline for this example set of 4 datasets.
+Assuming you have already gone through the setup instructions in *Getting Started*, you can now proceed with creating a group for this test dataset.
+
+A new *group* is created within the pipeline using the ``init`` operation as follows:
+
+::
+    python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v
+
+.. note::
+    Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages.
+    Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline.
+
+The output of the above command should look something like this:
+
+.. code-block:: console
+
+    INFO [main-group]: Running init steps as serial process
+    INFO [init]: Starting initialisation
+    INFO [init]: Copying input file from relative path - resolved to /home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
+    INFO [init]: Creating project directories
+    INFO [init]: Creating directories/filelists for 1/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Creating directories/filelists for 2/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Creating directories/filelists for 3/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Creating directories/filelists for 4/4
+    INFO [init]: Updated new status: init - complete
+    INFO [init]: Created 24 files, 8 directories in group my_new_group
+    INFO [init]: Written as group ID: my_new_group
+
+Ok great, we've initialised the pipeline for our new group! Here's a summary diagram of what directories and files were just created:
+
+::
+
+    WORKDIR
+      - groups
+         -  my_new_group
+             -  proj_codes
+                 -  main.txt
+             -  blacklist_codes.txt
+             -  datasets.csv # (a copy of the input file)
+
+      - in_progress
+         -  my_new_group
+             -  code_1 # (codes 1 to 4 in this example)
+                 -  allfiles.txt
+                 -  base-cfg.json
+                 -  phase_logs
+                     -  scan.log
+                     -  compute.log
+                     -  validate.log
+                 -  status_log.csv
+
+For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next:
+
+::
+
+    python assess.py progress my_new_group
+
+Upon which your output should look something like this:
+
+.. code-block:: console
+
+    Group: my_new_group
+    Total Codes: 4
+
+    Pipeline Current:
+
+    init      : 4     [100.%] (Variety: 1)
+        - complete : 4
+
+    Pipeline Complete:
+
+    complete  : 0     [0.0 %]
+
+All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet.
+
+The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline.
+
+.. note::
+    For each of the above phases, jobs will be submitted to SLURM when using the ``group_run`` script. Please make sure to wait until all jobs are complete for one phase *before* running the next job.
+    After each job, check the progress of the pipeline with the same command as before to check all the datasets ``complete`` as expected. See below on what to do if datasets encounter errors.
+
+.. code-block:: console
+
+    python group_run.py scan my_new_group
+    python group_run.py compute my_new_group
+    python group_run.py validate my_new_group
+
+An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below:
+
+.. code-block:: console
+
+    Group: cci_group_v1
+    Total Codes: 361
+
+    Pipeline Current:
+
+    compute   : 21    [5.8 %] (Variety: 2)
+        - complete                 : 20
+        - KeyError 'refs'          : 1
+
+    Pipeline Complete:
+
+    complete  : 185   [51.2%]
+
+    blacklist : 155   [42.9%] (Variety: 8)
+        - NonKerchunkable          : 50
+        - PartialDriver            : 3
+        - PartialDriverFail        : 5
+        - ExhaustedMemoryLimit     : 56
+        - ExhaustedTimeLimit       : 18
+        - ExhaustedTimeLimit*      : 1
+        - ValidationMemoryLimit    : 21
+        - ScipyDimIssue            : 1
+
+In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section). 
+Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below:
+
+.. code-block:: console
+
+    python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E
+
+This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor:
+
+.. code-block:: console
+
+    Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs'
+    Rerun suggested command:    python single_run.py compute 218 -G cci_group_v1 -vv -d
+
+This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times.
+
+
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -19,7 +19,7 @@ The pipeline consists of four central phases, with an additional phase for inges
 
    Introduction <pipeline-overview>
    Getting Started <start>
-   Worked Examples <examples>
+   Example CCI Water Vapour <cci_water>
    Pipeline Flags/Options <execution>
    Assessor Tool Overview <assess-overview>
    Error Codes <errors>

diff --git a/docs/source/start.rst b/docs/source/start.rst
@@ -3,19 +3,19 @@ Getting Started
 
 .. note::
 
-    Ensure you have local modules enabled such that you have python 3.x installed in your local environment.
+    Ensure you have local modules enabled such that you have python 3.x installed in your local environment. A version of the pipeline source code exists at ``/gws/nopw/j04/cedaproc/kerchunk_builder`` so please see if this can be used before cloning the repository elsewhere.
 
 Step 0: Git clone the repository
 --------------------------------
-The Kerchunk builder will soon be updated to version 1.0.1, which you can clone using:
+If you need to clone the repository, either simply clone the main branch of the repository (no branch specified) or check the latest version of the repository at github.com/cedadev/kerchunk-builder, which you can clone using:
 ::
 
     git clone [email protected]:cedadev/kerchunk-builder.git --branch v1.0.1
 
 Step 1: Set up Virtual Environment
 ----------------------------------
 
-Step 1 is to create a virtual environment and install the necessary packages with pip. This can be done inside the local repo you've cloned as ```local``` or ```build_venv``` which will be ignored by the repository, or you can create a venv elsewhere in your home directory i.e ```~/venvs/build_venv```
+Step 1 is to create a virtual environment and install the necessary packages with pip. This can be done inside the local repo you've cloned as ``local`` or ``kvenv`` which will be ignored by the repository, or you can create a venv elsewhere in your home directory i.e ``~/venvs/build_venv``. If you are using the pipeline version in ``cedaproc`` there should already be a virtual environment set up.
 
 .. code-block:: console
 
@@ -26,14 +26,13 @@ Step 1 is to create a virtual environment and install the necessary packages wit
 
 Step 2: Environment configuration
 ---------------------------------
-Create a config file to set necessary environment variables. (Suggested to place these in a local `templates/` folder as this will be ignored by git). Eg:
+Create a config file to set necessary environment variables. (Suggested to place these in the local `config/` folder as this will be ignored by git). Eg:
 
 .. code-block:: console
 
-    export WORKDIR  =/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline;
-    export GROUPDIR =/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline/groups/CMIP6_rel1_6233;
-    export SRCDIR   =/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder;
-    export KVENV    =/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv;
+    export WORKDIR = /path/to/kerchunk-pipeline
+    export SRCDIR  = /gws/nopw/j04/cedaproc/kerchunk_builder/kerchunk-builder
+    export KVENV   = $SRCDIR/kvenv
 
 
 Now you should be set up to run the pipeline properly. For any of the pipeline scripts, running ```python <script>.py -h # or --help``` will bring up a list of options to use for that script as well as the required parameters.
@@ -49,7 +48,6 @@ In order to successfully run the pipeline you need the following input files:
 
 It is also helpful to create a setup/config bash script to set all your environment variables which include:
  - WORKDIR: The working directory for the pipeline (where to store all the cache files)
- - GROUPDIR: Subdirectory under the working directory for the particular group you are running. (This is not required but could make things easier)
  - SRCDIR: Path to the kerchunk-builder repo where it has been cloned.
  - KVENV: Path to a virtual environment for the pipeline.
 
@@ -69,7 +67,7 @@ Some useful option/flags to add:
     -Q # Quality
        #  - thorough run - use to ignore cache files and perform checks on all netcdf files
     -r # repeat_id
-       #  - default uses main (1), if you have created repeat_ids manually or with assess.py, specify here [omit "proj_codes_"]
+       #  - default uses main, if you have created repeat_ids manually or with assess.py, specify here.
     -d # dryrun
        #  - Skip creating any new files in this phase