Skip to content

Commit

Permalink
Updated docs with example
Browse files Browse the repository at this point in the history
  • Loading branch information
dwest77a committed Mar 8, 2024
1 parent 8d4fc67 commit e4e7dac
Show file tree
Hide file tree
Showing 3 changed files with 147 additions and 11 deletions.
138 changes: 138 additions & 0 deletions docs/source/cci_water.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
CCI Water Vapour Example
========================

The CCI water vapour input CSV file can be found in ``extensions/example_water_vapour/`` within this repository. This guide will take you through running the pipeline for this example set of 4 datasets.
Assuming you have already gone through the setup instructions in *Getting Started*, you can now proceed with creating a group for this test dataset.

A new *group* is created within the pipeline using the ``init`` operation as follows:

::
python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v

.. note::
Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages.
Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline.

The output of the above command should look something like this:

.. code-block:: console
INFO [main-group]: Running init steps as serial process
INFO [init]: Starting initialisation
INFO [init]: Copying input file from relative path - resolved to /home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
INFO [init]: Creating project directories
INFO [init]: Creating directories/filelists for 1/4
INFO [init]: Updated new status: init - complete
INFO [init]: Creating directories/filelists for 2/4
INFO [init]: Updated new status: init - complete
INFO [init]: Creating directories/filelists for 3/4
INFO [init]: Updated new status: init - complete
INFO [init]: Creating directories/filelists for 4/4
INFO [init]: Updated new status: init - complete
INFO [init]: Created 24 files, 8 directories in group my_new_group
INFO [init]: Written as group ID: my_new_group
Ok great, we've initialised the pipeline for our new group! Here's a summary diagram of what directories and files were just created:

::

WORKDIR
- groups
- my_new_group
- proj_codes
- main.txt
- blacklist_codes.txt
- datasets.csv # (a copy of the input file)

- in_progress
- my_new_group
- code_1 # (codes 1 to 4 in this example)
- allfiles.txt
- base-cfg.json
- phase_logs
- scan.log
- compute.log
- validate.log
- status_log.csv

For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next:

::

python assess.py progress my_new_group

Upon which your output should look something like this:

.. code-block:: console
Group: my_new_group
Total Codes: 4
Pipeline Current:
init : 4 [100.%] (Variety: 1)
- complete : 4
Pipeline Complete:
complete : 0 [0.0 %]
All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet.

The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline.

.. note::
For each of the above phases, jobs will be submitted to SLURM when using the ``group_run`` script. Please make sure to wait until all jobs are complete for one phase *before* running the next job.
After each job, check the progress of the pipeline with the same command as before to check all the datasets ``complete`` as expected. See below on what to do if datasets encounter errors.

.. code-block:: console
python group_run.py scan my_new_group
python group_run.py compute my_new_group
python group_run.py validate my_new_group
An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below:

.. code-block:: console
Group: cci_group_v1
Total Codes: 361
Pipeline Current:
compute : 21 [5.8 %] (Variety: 2)
- complete : 20
- KeyError 'refs' : 1
Pipeline Complete:
complete : 185 [51.2%]
blacklist : 155 [42.9%] (Variety: 8)
- NonKerchunkable : 50
- PartialDriver : 3
- PartialDriverFail : 5
- ExhaustedMemoryLimit : 56
- ExhaustedTimeLimit : 18
- ExhaustedTimeLimit* : 1
- ValidationMemoryLimit : 21
- ScipyDimIssue : 1
In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section).
Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below:

.. code-block:: console
python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E
This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor:

.. code-block:: console
Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs'
Rerun suggested command: python single_run.py compute 218 -G cci_group_v1 -vv -d
This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times.



2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The pipeline consists of four central phases, with an additional phase for inges

Introduction <pipeline-overview>
Getting Started <start>
Worked Examples <examples>
Example CCI Water Vapour <cci_water>
Pipeline Flags/Options <execution>
Assessor Tool Overview <assess-overview>
Error Codes <errors>
Expand Down
18 changes: 8 additions & 10 deletions docs/source/start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@ Getting Started

.. note::

Ensure you have local modules enabled such that you have python 3.x installed in your local environment.
Ensure you have local modules enabled such that you have python 3.x installed in your local environment. A version of the pipeline source code exists at ``/gws/nopw/j04/cedaproc/kerchunk_builder`` so please see if this can be used before cloning the repository elsewhere.

Step 0: Git clone the repository
--------------------------------
The Kerchunk builder will soon be updated to version 1.0.1, which you can clone using:
If you need to clone the repository, either simply clone the main branch of the repository (no branch specified) or check the latest version of the repository at github.com/cedadev/kerchunk-builder, which you can clone using:
::

git clone [email protected]:cedadev/kerchunk-builder.git --branch v1.0.1

Step 1: Set up Virtual Environment
----------------------------------

Step 1 is to create a virtual environment and install the necessary packages with pip. This can be done inside the local repo you've cloned as ```local``` or ```build_venv``` which will be ignored by the repository, or you can create a venv elsewhere in your home directory i.e ```~/venvs/build_venv```
Step 1 is to create a virtual environment and install the necessary packages with pip. This can be done inside the local repo you've cloned as ``local`` or ``kvenv`` which will be ignored by the repository, or you can create a venv elsewhere in your home directory i.e ``~/venvs/build_venv``. If you are using the pipeline version in ``cedaproc`` there should already be a virtual environment set up.

.. code-block:: console
Expand All @@ -26,14 +26,13 @@ Step 1 is to create a virtual environment and install the necessary packages wit
Step 2: Environment configuration
---------------------------------
Create a config file to set necessary environment variables. (Suggested to place these in a local `templates/` folder as this will be ignored by git). Eg:
Create a config file to set necessary environment variables. (Suggested to place these in the local `config/` folder as this will be ignored by git). Eg:

.. code-block:: console
export WORKDIR =/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline;
export GROUPDIR =/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline/groups/CMIP6_rel1_6233;
export SRCDIR =/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder;
export KVENV =/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv;
export WORKDIR = /path/to/kerchunk-pipeline
export SRCDIR = /gws/nopw/j04/cedaproc/kerchunk_builder/kerchunk-builder
export KVENV = $SRCDIR/kvenv
Now you should be set up to run the pipeline properly. For any of the pipeline scripts, running ```python <script>.py -h # or --help``` will bring up a list of options to use for that script as well as the required parameters.
Expand All @@ -49,7 +48,6 @@ In order to successfully run the pipeline you need the following input files:

It is also helpful to create a setup/config bash script to set all your environment variables which include:
- WORKDIR: The working directory for the pipeline (where to store all the cache files)
- GROUPDIR: Subdirectory under the working directory for the particular group you are running. (This is not required but could make things easier)
- SRCDIR: Path to the kerchunk-builder repo where it has been cloned.
- KVENV: Path to a virtual environment for the pipeline.

Expand All @@ -69,7 +67,7 @@ Some useful option/flags to add:
-Q # Quality
# - thorough run - use to ignore cache files and perform checks on all netcdf files
-r # repeat_id
# - default uses main (1), if you have created repeat_ids manually or with assess.py, specify here [omit "proj_codes_"]
# - default uses main, if you have created repeat_ids manually or with assess.py, specify here.
-d # dryrun
# - Skip creating any new files in this phase
Expand Down

0 comments on commit e4e7dac

Please sign in to comment.