Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.1 Major changes to Error Logging #12

Merged
merged 49 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
51613b2
Updated inspection method
dwest77a Feb 20, 2024
cec7630
Aded backtrack and xkshape warning
dwest77a Feb 20, 2024
ab5772d
Added backtrack
dwest77a Feb 20, 2024
934f093
Added XKShape tolerance and get_concat_dims for multiple validation o…
dwest77a Feb 20, 2024
df6fddb
Updated version
dwest77a Feb 20, 2024
d6edc55
Updated setup and added cci env variables
dwest77a Feb 20, 2024
36cad7e
Changed variable difference message between files
dwest77a Feb 20, 2024
1d7f8ac
Merge branch 'dev' into dev_update
dwest77a Feb 20, 2024
91f2ccb
Merge pull request #10 from cedadev/dev_update
dwest77a Feb 20, 2024
b8de738
Integration of blacklist into progress tracker
dwest77a Feb 21, 2024
c1a6ae6
Added dryrun carrythrough
dwest77a Feb 21, 2024
5b18988
Overhauled dimension handling and added extra functions, scanning now…
dwest77a Feb 21, 2024
1688b01
Added functionality for dimension highlighting and concatenation test…
dwest77a Feb 21, 2024
66ef1d4
Added multiple nan-type checks
dwest77a Feb 21, 2024
22a74d7
Added concat fatal error for l2 data
dwest77a Feb 21, 2024
aa96b57
Integration with other pipeline functions
dwest77a Feb 21, 2024
1abd97c
Added blacklist/virtual support, reading log and status files
dwest77a Mar 6, 2024
b964318
Minor changes experimenting with streamline/dependency - abandoned
dwest77a Mar 6, 2024
d8790b2
Removed error and output log options per job
dwest77a Mar 6, 2024
6334ec4
Set up a dedicated place for config files
dwest77a Mar 6, 2024
f5ea0df
Organised multi-purpose functions to a single script within the pipeline
dwest77a Mar 6, 2024
84139a8
Initialised allocator script
dwest77a Mar 6, 2024
7fc1387
Moved scripts
dwest77a Mar 6, 2024
c4bbc23
Added requirements that were misplaced
dwest77a Mar 6, 2024
512b26a
Added new features; loading and creating refs as required, saving ref…
dwest77a Mar 6, 2024
1d66435
Added file handling and status logging
dwest77a Mar 6, 2024
2f3b154
Updated import locations
dwest77a Mar 6, 2024
89e7f37
Removed scanfile skip option
dwest77a Mar 6, 2024
ab9890d
Added file handlers and status logging, also pass/fail counters
dwest77a Mar 6, 2024
6355ed8
Added test/example notebooks
dwest77a Mar 6, 2024
32f7c3f
Updated with comments and rearranged sections
dwest77a Mar 6, 2024
7b71b88
Merge pull request #11 from cedadev/cci_cases
dwest77a Mar 6, 2024
834e867
Removed old examples
dwest77a Mar 8, 2024
b004939
Removed unnecessary or outdated files
dwest77a Mar 8, 2024
2a9669b
Updated for move to cedaproc
dwest77a Mar 8, 2024
905291e
Added template for cedaproc use
dwest77a Mar 8, 2024
775a640
Removed testing notebooks and updated showcase tools
dwest77a Mar 8, 2024
5aefe8e
Updated assessor documentation
dwest77a Mar 8, 2024
42335f2
updated gitignore
dwest77a Mar 8, 2024
2721234
Added example input file for tutorial and startup:
dwest77a Mar 8, 2024
8d4fc67
Added introductory powerpoints
dwest77a Mar 8, 2024
e4e7dac
Updated docs with example
dwest77a Mar 8, 2024
ff9d068
Added template setup file
dwest77a Mar 8, 2024
43181e1
Fixed syntax issue
dwest77a Mar 8, 2024
8e9f3ec
Renamed file
dwest77a Mar 8, 2024
85ea009
Removed deploy function for dependent jobs
dwest77a Mar 8, 2024
363d953
Updated utils, added filehandler continuation support for all pipelin…
dwest77a Mar 8, 2024
c03aeb3
Added job log system support and traceback within error logs
dwest77a Mar 8, 2024
affa3e4
Merge branch 'main' into dev
dwest77a Mar 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
.ipynb_checkpoints
build_venv/
kvenv/
temp/
testing/
*__pycache__*
.vscode/
docs/build/
build/
pipeline.egg-info
810 changes: 488 additions & 322 deletions assess.py

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions config/setup-cci.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
module load jaspy

export WORKDIR=/gws/nopw/j04/esacci_portal/kerchunk_conversion/
export SRCDIR=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
export KVENV=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv

source $KVENV/bin/activate
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
module load jaspy

export WORKDIR=/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline
export GROUPDIR=/gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline/groups/CMIP6_rel1_6233
export SRCDIR=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
export KVENV=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv
export KVENV=/home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder/build_venv

source $KVENV/bin/activate
File renamed without changes.
7 changes: 7 additions & 0 deletions config/setup-project.sh.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
module load jaspy

export WORKDIR =/path/to/kerchunk-pipeline
export SRCDIR =/gws/nopw/j04/cedaproc/kerchunk_builder/kerchunk-builder
export KVENV =$SRCDIR/kvenv

source $KVENV/bin/activate
66 changes: 55 additions & 11 deletions docs/source/assess-overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,45 +4,89 @@ Assessor Tool
The assessor script ```assess.py``` is an all-purpose pipeline checking tool which can be used to assess:
- The current status of all datasets within a given group in the pipeline (which phase each dataset currently sits in)
- The errors/outputs associated with previous job runs.
- Specific logs from datasets which are presenting a specific type of error.

An example command to run the assessor tool can be found below:
::
python assess.py <operation> <group>

Where the operation can be one of the below options:
- Progress: Get a general overview of the pipeline; how many datasets have completed or are stuck on each phase.
- Summarise: Get an assessment of the data processed for this group
- Display: Display a specific type of information about the pipeline (blacklisted codes, datasets with virtual dimensions or using parquet)

1. Overall Progress of the Pipeline
-----------------------------------

To see the general status of the pipeline for a given group:
::

python assess.py <group> progress
python assess.py progress <group>

An example output from this command can be seen below:
::

Group: cci_group_v1
Total Codes: 361

scan : 1 [0.3 %] (Variety: 1)
- Complete : 1

complete : 185 [51.2%] (Variety: 1)
- complete : 185

unknown : 21 [5.8 %] (Variety: 1)
- no data : 21

blacklist : 162 [44.9%] (Variety: 7)
- NonKerchunkable : 50
- PartialDriver : 3
- PartialDriverFail : 5
- ExhaustedMemoryLimit : 64
- ExhaustedTimeLimit : 18
- ExhaustedTimeLimit* : 1
- ValidationMemoryLimit : 21

In this case there are 185 datasets that have completed the pipeline with 1 left to be scanned. The 21 unknowns have no log file so there is no information on these. This will be resolved in later versions where a `seek` function will automatically run when checking the progress, to fix gaps in the logs for missing datasets.


An example use case is to write out all datasets that require scanning to a new label (repeat_label):
::

python assess.py <group> progress -p scan -r <label_for_scan_subgroup> -W
python assess.py progress <group> -p scan -r <label_for_scan_subgroup> -W


The last flag ```-W``` is required when writing an output file from this program, otherwise the program will dryrun and produce no files.

2. Checking errors
------------------
Check what repeat labels are available already using
Check what repeat labels are available already using:
::

python assess.py <group> errors -s labels

python assess.py display <group> -s labels

Show what jobs have previously run
For listing the status of all datasets from a previous repeat idL
::

python assess.py <group> errors -s jobids
python assess.py progress <group> -r <repeat_id>


For showing all errors from a previous job run
For selecting a specific type of error (-e) and examine the full log for each example (-E)
::

python assess.py <group> errors -j <jobid>
python assess.py progress <group> -r <old_id> -e "type_of_error" -p scan -E

Following from this, you may want to rerun the pipeline for just one type of error previously found:
::

python assess.py progress <group> -r <old_repeat_id> -e "type_of_error" -p scan -n <new_repeat_id>

Note that if you are looking at a specific repeat ID, you can forego the phase (-p) flag, since it is expected this set would appear in the same phase anyway.

3. Special Display options
--------------------------

For selecting a specific type of error to investigate (-i) and examine the full log for each example (-E)
Check how many of the datasets in a group have virtual dimensions
::

python assess.py test errors -j <jobid> -i "type_of_error" -E
python assess.py display <group> -s virtuals
139 changes: 139 additions & 0 deletions docs/source/cci_water.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
CCI Water Vapour Example
========================

The CCI water vapour input CSV file can be found in ``extensions/example_water_vapour/`` within this repository. This guide will take you through running the pipeline for this example set of 4 datasets.
Assuming you have already gone through the setup instructions in *Getting Started*, you can now proceed with creating a group for this test dataset.

A new *group* is created within the pipeline using the ``init`` operation as follows:

::

python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v

.. note::
Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages.
Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline.

The output of the above command should look something like this:

.. code-block:: console

INFO [main-group]: Running init steps as serial process
INFO [init]: Starting initialisation
INFO [init]: Copying input file from relative path - resolved to /home/users/dwest77/Documents/kerchunk_dev/kerchunk-builder
INFO [init]: Creating project directories
INFO [init]: Creating directories/filelists for 1/4
INFO [init]: Updated new status: init - complete
INFO [init]: Creating directories/filelists for 2/4
INFO [init]: Updated new status: init - complete
INFO [init]: Creating directories/filelists for 3/4
INFO [init]: Updated new status: init - complete
INFO [init]: Creating directories/filelists for 4/4
INFO [init]: Updated new status: init - complete
INFO [init]: Created 24 files, 8 directories in group my_new_group
INFO [init]: Written as group ID: my_new_group

Ok great, we've initialised the pipeline for our new group! Here's a summary diagram of what directories and files were just created:

::

WORKDIR
- groups
- my_new_group
- proj_codes
- main.txt
- blacklist_codes.txt
- datasets.csv # (a copy of the input file)

- in_progress
- my_new_group
- code_1 # (codes 1 to 4 in this example)
- allfiles.txt
- base-cfg.json
- phase_logs
- scan.log
- compute.log
- validate.log
- status_log.csv

For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next:

::

python assess.py progress my_new_group

Upon which your output should look something like this:

.. code-block:: console

Group: my_new_group
Total Codes: 4

Pipeline Current:

init : 4 [100.%] (Variety: 1)
- complete : 4

Pipeline Complete:

complete : 0 [0.0 %]

All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet.

The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline.

.. note::
For each of the above phases, jobs will be submitted to SLURM when using the ``group_run`` script. Please make sure to wait until all jobs are complete for one phase *before* running the next job.
After each job, check the progress of the pipeline with the same command as before to check all the datasets ``complete`` as expected. See below on what to do if datasets encounter errors.

.. code-block:: console

python group_run.py scan my_new_group
python group_run.py compute my_new_group
python group_run.py validate my_new_group

An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below:

.. code-block:: console

Group: cci_group_v1
Total Codes: 361

Pipeline Current:

compute : 21 [5.8 %] (Variety: 2)
- complete : 20
- KeyError 'refs' : 1

Pipeline Complete:

complete : 185 [51.2%]

blacklist : 155 [42.9%] (Variety: 8)
- NonKerchunkable : 50
- PartialDriver : 3
- PartialDriverFail : 5
- ExhaustedMemoryLimit : 56
- ExhaustedTimeLimit : 18
- ExhaustedTimeLimit* : 1
- ValidationMemoryLimit : 21
- ScipyDimIssue : 1

In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section).
Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below:

.. code-block:: console

python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E

This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor:

.. code-block:: console

Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs'
Rerun suggested command: python single_run.py compute 218 -G cci_group_v1 -vv -d

This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times.



67 changes: 0 additions & 67 deletions docs/source/examples.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The pipeline consists of four central phases, with an additional phase for inges

Introduction <pipeline-overview>
Getting Started <start>
Worked Examples <examples>
Example CCI Water Vapour <cci_water>
Pipeline Flags/Options <execution>
Assessor Tool Overview <assess-overview>
Error Codes <errors>
Expand Down
Loading
Loading