-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
147 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
CCI Water Vapour Example | ||
======================== | ||
|
||
The CCI water vapour input CSV file can be found in ``extensions/example_water_vapour/`` within this repository. This guide will take you through running the pipeline for this example set of 4 datasets. | ||
Assuming you have already gone through the setup instructions in *Getting Started*, you can now proceed with creating a group for this test dataset. | ||
|
||
A new *group* is created within the pipeline using the ``init`` operation as follows: | ||
|
||
:: | ||
python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v | ||
|
||
.. note:: | ||
Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages. | ||
Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline. | ||
|
||
The output of the above command should look something like this: | ||
|
||
.. code-block:: console | ||
INFO [main-group]: Running init steps as serial process | ||
INFO [init]: Starting initialisation | ||
INFO [init]: Copying input file from relative path - resolved to /home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder | ||
INFO [init]: Creating project directories | ||
INFO [init]: Creating directories/filelists for 1/4 | ||
INFO [init]: Updated new status: init - complete | ||
INFO [init]: Creating directories/filelists for 2/4 | ||
INFO [init]: Updated new status: init - complete | ||
INFO [init]: Creating directories/filelists for 3/4 | ||
INFO [init]: Updated new status: init - complete | ||
INFO [init]: Creating directories/filelists for 4/4 | ||
INFO [init]: Updated new status: init - complete | ||
INFO [init]: Created 24 files, 8 directories in group my_new_group | ||
INFO [init]: Written as group ID: my_new_group | ||
Ok great, we've initialised the pipeline for our new group! Here's a summary diagram of what directories and files were just created: | ||
|
||
:: | ||
|
||
WORKDIR | ||
- groups | ||
- my_new_group | ||
- proj_codes | ||
- main.txt | ||
- blacklist_codes.txt | ||
- datasets.csv # (a copy of the input file) | ||
|
||
- in_progress | ||
- my_new_group | ||
- code_1 # (codes 1 to 4 in this example) | ||
- allfiles.txt | ||
- base-cfg.json | ||
- phase_logs | ||
- scan.log | ||
- compute.log | ||
- validate.log | ||
- status_log.csv | ||
|
||
For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next: | ||
|
||
:: | ||
|
||
python assess.py progress my_new_group | ||
|
||
Upon which your output should look something like this: | ||
|
||
.. code-block:: console | ||
Group: my_new_group | ||
Total Codes: 4 | ||
Pipeline Current: | ||
init : 4 [100.%] (Variety: 1) | ||
- complete : 4 | ||
Pipeline Complete: | ||
complete : 0 [0.0 %] | ||
All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet. | ||
|
||
The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline. | ||
|
||
.. note:: | ||
For each of the above phases, jobs will be submitted to SLURM when using the ``group_run`` script. Please make sure to wait until all jobs are complete for one phase *before* running the next job. | ||
After each job, check the progress of the pipeline with the same command as before to check all the datasets ``complete`` as expected. See below on what to do if datasets encounter errors. | ||
|
||
.. code-block:: console | ||
python group_run.py scan my_new_group | ||
python group_run.py compute my_new_group | ||
python group_run.py validate my_new_group | ||
An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below: | ||
|
||
.. code-block:: console | ||
Group: cci_group_v1 | ||
Total Codes: 361 | ||
Pipeline Current: | ||
compute : 21 [5.8 %] (Variety: 2) | ||
- complete : 20 | ||
- KeyError 'refs' : 1 | ||
Pipeline Complete: | ||
complete : 185 [51.2%] | ||
blacklist : 155 [42.9%] (Variety: 8) | ||
- NonKerchunkable : 50 | ||
- PartialDriver : 3 | ||
- PartialDriverFail : 5 | ||
- ExhaustedMemoryLimit : 56 | ||
- ExhaustedTimeLimit : 18 | ||
- ExhaustedTimeLimit* : 1 | ||
- ValidationMemoryLimit : 21 | ||
- ScipyDimIssue : 1 | ||
In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section). | ||
Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below: | ||
|
||
.. code-block:: console | ||
python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E | ||
This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor: | ||
|
||
.. code-block:: console | ||
Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs' | ||
Rerun suggested command: python single_run.py compute 218 -G cci_group_v1 -vv -d | ||
This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,19 +3,19 @@ Getting Started | |
|
||
.. note:: | ||
|
||
Ensure you have local modules enabled such that you have python 3.x installed in your local environment. | ||
Ensure you have local modules enabled such that you have python 3.x installed in your local environment. A version of the pipeline source code exists at ``/gws/nopw/j04/cedaproc/kerchunk_builder`` so please see if this can be used before cloning the repository elsewhere. | ||
|
||
Step 0: Git clone the repository | ||
-------------------------------- | ||
The Kerchunk builder will soon be updated to version 1.0.1, which you can clone using: | ||
If you need to clone the repository, either simply clone the main branch of the repository (no branch specified) or check the latest version of the repository at github.com/cedadev/kerchunk-builder, which you can clone using: | ||
:: | ||
|
||
git clone [email protected]:cedadev/kerchunk-builder.git --branch v1.0.1 | ||
|
||
Step 1: Set up Virtual Environment | ||
---------------------------------- | ||
|
||
Step 1 is to create a virtual environment and install the necessary packages with pip. This can be done inside the local repo you've cloned as ```local``` or ```build_venv``` which will be ignored by the repository, or you can create a venv elsewhere in your home directory i.e ```~/venvs/build_venv``` | ||
Step 1 is to create a virtual environment and install the necessary packages with pip. This can be done inside the local repo you've cloned as ``local`` or ``kvenv`` which will be ignored by the repository, or you can create a venv elsewhere in your home directory i.e ``~/venvs/build_venv``. If you are using the pipeline version in ``cedaproc`` there should already be a virtual environment set up. | ||
|
||
.. code-block:: console | ||
|
@@ -26,14 +26,13 @@ Step 1 is to create a virtual environment and install the necessary packages wit | |
Step 2: Environment configuration | ||
--------------------------------- | ||
Create a config file to set necessary environment variables. (Suggested to place these in a local `templates/` folder as this will be ignored by git). Eg: | ||
Create a config file to set necessary environment variables. (Suggested to place these in the local `config/` folder as this will be ignored by git). Eg: | ||
|
||
.. code-block:: console | ||
export WORKDIR =/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline; | ||
export GROUPDIR =/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline/groups/CMIP6_rel1_6233; | ||
export SRCDIR =/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder; | ||
export KVENV =/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv; | ||
export WORKDIR = /path/to/kerchunk-pipeline | ||
export SRCDIR = /gws/nopw/j04/cedaproc/kerchunk_builder/kerchunk-builder | ||
export KVENV = $SRCDIR/kvenv | ||
Now you should be set up to run the pipeline properly. For any of the pipeline scripts, running ```python <script>.py -h # or --help``` will bring up a list of options to use for that script as well as the required parameters. | ||
|
@@ -49,7 +48,6 @@ In order to successfully run the pipeline you need the following input files: | |
|
||
It is also helpful to create a setup/config bash script to set all your environment variables which include: | ||
- WORKDIR: The working directory for the pipeline (where to store all the cache files) | ||
- GROUPDIR: Subdirectory under the working directory for the particular group you are running. (This is not required but could make things easier) | ||
- SRCDIR: Path to the kerchunk-builder repo where it has been cloned. | ||
- KVENV: Path to a virtual environment for the pipeline. | ||
|
||
|
@@ -69,7 +67,7 @@ Some useful option/flags to add: | |
-Q # Quality | ||
# - thorough run - use to ignore cache files and perform checks on all netcdf files | ||
-r # repeat_id | ||
# - default uses main (1), if you have created repeat_ids manually or with assess.py, specify here [omit "proj_codes_"] | ||
# - default uses main, if you have created repeat_ids manually or with assess.py, specify here. | ||
-d # dryrun | ||
# - Skip creating any new files in this phase | ||
|