Merge pull request #36 from gjbex/development

Slurmify documentation
gjbex · May 24, 2024 · 429490f · 429490f
2 parents 788618d + 16a1bac
commit 429490f
Show file tree

Hide file tree

Showing 15 changed files with 128 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -29,11 +29,13 @@ common tasks with minimal modification of job scripts.
 
 
 ## Documentation
+
 Full documentation is available on
 [Read the Docs](http://atools.readthedocs.io/en/latest/).
 
 
 ## Important note
+
 If you use job arrays on a HPC system that accounts for compute time,
 remember that each job in the array is account as an individual job.
 Depending on the number of cores used by a job, this may increase the
@@ -42,6 +44,7 @@ cost by a large factor compared to using the
 
 
 ## Requirements
+
 `atools` requires at least Python 3.2, but only uses the standard
 library.
 
@@ -51,17 +54,21 @@ that can be change during installation.
 
 
 ## Installing
+
 After downloading and unpacking, simply run `configure` and `make`.  For
 details, see the [documentation](http://atools.readthedocs.io/en/latest/).
 
 
 ## Planned features
+
 In no particular order...
+
 * Template based job script creation
 * Indexed data files for scaling to very large numbers of tasks.
 
 
 ## Contributors
+
 * [Geert Jan Bex](mailto:[email protected]), Hasselt University/University
     of Leuven
 * Stefan Becuwe, Antwerp University

diff --git a/configure.ac b/configure.ac
@@ -1,4 +1,4 @@
-AC_INIT([atools], [1.5.1], [[email protected]])
+AC_INIT([atools], [1.5.2], [[email protected]])
 AM_INIT_AUTOMAKE([-Wall -Werror foreign tar-pax])
 
 AC_CONFIG_FILES([

diff --git a/docs/acreate.md b/docs/acreate.md
@@ -1,28 +1,30 @@
 # Adding atools features using templates
 
 Although the modifications required to use `atools` are fairly simple,
-they involve some steps that my be unfamiliar to the casual user.
+they involve some steps that may be unfamiliar to the casual user.
 
 `acreate` adds everything required to use `atools` effectively to an
 existing job script.  By default, it will insert the `PATH` redefinition
 to use the `atools` commands, and the logging of start and end events.
-Suppose the original log file is called `bootstrap.pbs`, then the command
+Suppose the original job script is called `jobscript.slurm`, then the command
 to generate the file annotated for `atools` is:
+
 ```bash
-$ acreate  bootstrap.pbs  >  bootstrap_atools.pbs
+$ acreate  jobscript.slurm  >  jobscript_atools.slurm
 ```
 
 If `aenv` is to be used, in addition to logging, you simply add the
 `--data` option:
 ```bash
-$ acreate  --data data.csv --  bootstrap.pbs  > bootstrap_atools.pbs
+$ acreate  --data data.csv --  jobscript.slurm  > jobscript_atools.slurm
 ```
 
 The default shell is the one specified in the configuration file, but
 this can be overridden on the command line using the `--shell` option,
-e.g., if `bootstrap.pbs` where a tcsh shell script, you would use
+e.g., if `jobscript.slurm` where a tcsh shell script, you would use
+
 ```bash
-$ acreate  --shell tcsh  bootstrap.pbs  >  bootstrap_atools.pbs
+$ acreate  --shell tcsh  jobscript.slurm  >  jobscript_atools.slurm
 ```
 
 It is also possible to supply your own template instead of the one provided

diff --git a/docs/aenv.md b/docs/aenv.md
@@ -1,4 +1,5 @@
 # Getting your parameters: `aenv`
+
 The parameters for tasks can be stored in an CSV file, where the first
 row is simply the names of the parameters, and each consecutive row
 represents the values of these parameters for a specific experiment, i.e.,
@@ -7,19 +8,22 @@ computational task.
 `aenv` will use the task identifier as an index into this CSV file, and
 define environment variables with the appropriate values for that task.
 As an example, consider the following PBS script:
+
 ```bash
-#!/bin/bash
+#!/usr/bin/env -S bash -l
 ...
 alpha=0.5
 beta=-1.3
 Rscript bootstrap.R $alpha $beta
 ...
 ```
+
 However, this approach would lead to as many job scripts are there are
 parameter instances, which is inconvenient to say the least.
 
 This computation would have to be done for many values for `alpha` and
 `beta`.  These values can be represented in an CSV file, `data.csv`:
+
 ```
 alpha,beta
 0.5,-1.3
@@ -28,15 +32,18 @@ alpha,beta
 0.6,-1.3
 ...
 ```
+
 The job script can be modified to automatically define the appropriate
 values for `alpha` and `beta` specific for the task.
+
 ```bash
-#!/bin/bash
+#!/usr/bin/env -S bash -l
 ...
 source <(aenv --data data.csv)
 Rscript bootstrap.R $alpha $beta
 ...
 ```
+
 `aenv` will use the value of the task identifier to read the corresponding
 row in the `data.csv` CSV file, and export the variables `alpha` and `beta`
 with those values.

diff --git a/docs/aload.md b/docs/aload.md
@@ -1,4 +1,5 @@
 # Detailed job statistics
+
 Gathering statistics about the execution time of tasks is straightforward
 using `aload`.  Given the log file(s) of a job, it will
 
@@ -12,9 +13,11 @@ or to report on.  The second statistics may be helpful to estimate load
 imbalance, and improve resource requests for future jobs.
 
 Using `aload` is simple:
+
 ```bash
-$ aload  --log bootstrap.pbs.log10493
+$ aload  --log jobscript.slurm.log10493
 ```
+
 It is not always useful to include failed items in the statistics since
 their execution time may seriously skew the results.  They can be excluded
 by adding the `--no_failed` flag to the call to `aload`.
@@ -23,9 +26,11 @@ Sometimes it can be useful to compute more detailed statistics or plot
 distributions of, e.g., the task execution time.  It is beyond the scope
 of `aload` to do this, but the data can be exported for further analysis
 by adding the `--list_tasks` flag, i.e.,
+
 ```bash
-$ aload  --log bootstrap.pbs.log10493  --list_tasks
+$ aload  --log jobscript.slurm.log10493  --list_tasks
 ```
+
 Similarly, for raw data on the slaves, add the `--list_slaves` flag.
 If the output is to be imported in a software package, or parsed by a
 script, it can be more convenient to obtain it in CSV format by adding the

diff --git a/docs/alog.md b/docs/alog.md
@@ -1,4 +1,5 @@
 # Logging for fun and profit
+
 Often, it is useful to log information about the execution of individual
 tasks.  This information can be used
 
@@ -16,29 +17,33 @@ centralized logging in a single file.  This requires a POSIX compliant
 shared file system when the job is running on multiple compute nodes.
 
 Again, consider the fragment of the job script:
+
 ```bash
-#!/bin/bash
+#!/usr/bin/env -S bash -l
 ...
 alpha=0.5
 beta=-1.3
 Rscript bootstrap.R $alpha $beta
 ...
 ```
+
 (Note that `aenv` was not used here, which was done to stress the point
 that `alog` and `aenv` are independent of one another, but obviously can
 be combined.)
 
 To enable logging, a call to `alog` is added as the first and the last
 executable line of the job script, i.e.,
+
 ```bash
-#!/bin/bash
+#!/usr/bin/env -S bash -l
 ...
 alog --state start
 alpha=0.5
 beta=-1.3
 Rscript bootstrap.R $alpha $beta
 alog  --state end  --exit $?
 ```
+
 Here we assume that the exit status of the last actual job command
 (`Rscript` in this example) is also the exit status of the task.  The
 Linux convention is that exit code 0 signifies success, any value between
@@ -52,14 +57,15 @@ The resulting log file is automatically created, and its name will be
 the conventions of the queue system or scheduler used.
 
 The log file will look like, e.g.,
-```
 
+```
 1 started by r1i1n3 at 2016-09-02 11:47:45
 2 started by r1i1n3 at 2016-09-02 11:47:45
 3 started by r1i1n3 at 2016-09-02 11:47:46
 2 failed by r1i1n3 at 2016-09-02 11:47:46: 1
 3 completed by r1i1n3 at 2016-09-02 11:47:47
 ```
+
 The format is `<task-id> <status> by <node-name> at <time-stamp>`, followed
 by `: <exit-status>` for failed jobs.  For this particular example, task
 1 didn't complete, 2 failed, and 3 completed successfully.

diff --git a/docs/arange.md b/docs/arange.md
@@ -1,4 +1,5 @@
 # Monitoring jobs and resuming tasks
+
 Keeping track of the tasks already completed, successfully or not, or tasks
 still pending can be somewhat annoying.  Resuming tasks that were not
 completed, or that failed requires a level of bookkeeping you may prefer
@@ -8,14 +9,16 @@ Note that for this to work, your job should do logging using
 [`alog`](alog.md).
 
 ## Monitoring a running job
+
 Given either the CSV file or the task identifier range for a job, and its
 log file as generated by `alog`, `arange` will provide statistics on the
 progress of a running job, or a summary on a completed job.
 
-If the log file's name is `bootstrap.pbs.log10493`, and the job was based
+If the log file's name is `jobscript.slurm.log10493`, and the job was based
 on an CSV data file `data.csv`, a summary can be obtained by
+
 ```bash
-$ arange  --data data.csv  --log bootstrap.pbs.log10493  --summary
+$ arange  --data data.csv  --log jobscript.slurm.log10493  --summary
 ```
 In case a job has been resumed, you should list all log files relevant to
 the job to get correct results.
@@ -32,7 +35,7 @@ It can be switched off using the `--no_sniffer` option.
 Of course, `arange` works independently of `aenv`, so it also supports
 keeping track of general job arrays using the `-t` flag.
 ```bash
-$ arange  -t 1-250  --log bootstrap.pbs.log10493  --summary
+$ arange  -t 1-250  --log jobscript.slurm.log10493  --summary
 ```
 
 Sometimes it is useful to explicitly list the task identifiers of either
@@ -45,11 +48,11 @@ identifiers should be redone when an array job did not complete, or when
 some of its tasks failed.  To get an identifier range of tasks that were
 not completed, use
 ```bash
-$ arange  --data data.csv  --log bootstrap.pbs.log10493`
+$ arange  --data data.csv  --log jobscript.slurm.log10493`
 ```
 or, when not using `aenv`
 ```bash
-$ arange  -t 1-250  --log bootstrap.pbs.log10493`
+$ arange  -t 1-250  --log jobscript.slurm.log10493`
 ```
 
 If you want to include the tasks that failed, for instance when a bug that

diff --git a/docs/areduce.md b/docs/areduce.md
@@ -11,12 +11,20 @@ file, not replicated throughout the aggregated file.  More complicated
 aggregations, e.g., into an R dataframe required some programming.
 
 Suppose that the output of each task is stored in a file with name
-`out-{PBS_ARRAYID}.txt` where `PBS_ARRAYID` represents the array ID of
-the respective task, and the final output should be a file `out.txt` that
-is the concatenation of all the individual files.
+`out-{PBS_ARRAYID}.txt` or `out-{SLURM_ARRAY_TASK_ID}.txt` where `PBS_ARRAYID`
+or `SLURM_ARRAY_TASK_ID` represents the array ID of the respective task for PBS
+Torque or Slurm respectively, and the final output should be a file `out.txt`
+that is the concatenation of all the individual files.
+
 ```bash
 $ areduce  -t 1-250  --pattern 'out-{PBS_ARRAYID}.txt'  --out out.txt
 ```
+
+Similar for Slurm:
+```bash
+$ areduce  -t 1-250  --pattern 'out-{SLURM_ARRAY_TASK_ID}.txt'  --out out.txt
+```
+
 Although this could be easily achieved with `cat`, there are nevertheless
 advantages to using `areduce` even in this very simple case.  `areduce`
 handles missing files (failed tasks) gracefully, whereas the corresponding
@@ -26,12 +34,14 @@ proper order of the files, while this would be cumbersome to do by hand.
 If each of the output files were CSV files, the first line of each file
 would contain the field names, that in the aggregated file should occur
 only once as the first line.
+
 ```bash
 $ areduce  -t 1-250  --pattern 'out-{t}.csv'  --out out.csv  --mode csv
 ```
+
 The command above will produce the desired CSV file without any hassle.
-Note that the shorthand `t` for `PBS_ARRAYID` has been used in the file
-name pattern specification.
+Note that the shorthand `t` for `PBS_ARRAYID` or `SLURM_ARRAY_TASK_ID`
+has been used in the file name pattern specification.
 
 When one or more tasks failed, you may not want to aggregate the output of
 those tasks since it may be incomplete and/or incorrect.  In that case,
@@ -51,6 +61,7 @@ specified via  the `--mode` option.
 
 For example, the following command would aggregate data skipping three lines
 at the top, and five lines at the bottom of each individual output file:
+
 ```bash
 $ areduce  -t 1-250  --pattern 'out-{t}.txt'  --out out.txt  \
            --mode body  --reduce_args '--h 3  -f 5'
@@ -67,6 +78,7 @@ Examples can be found in the `reduce` directory.
 
 Arguments can be passed to the `empty` and `reduce` script as in the example
 below:
+
 ```bash
 $ areduce  -t 1-250  --pattern 'out-{t}.csv'  --out out.csv  \
            --empty my_empty  --reduce my_reduce  --reduce_args '--h 3'

diff --git a/docs/job_arrays.md b/docs/job_arrays.md
@@ -1,17 +1,28 @@
 # What are job arrays?
+
 A resource manager or scheduler that support job arrays typically
 exposes a task identifier to the job script as an environment variable.
 This is simply a number out of a range specified when the job is submitted.
 
 For the resource managers and schedulers supported by `atools`, that would
 be
+
 * `PBS_ARRAYID` for PBS torque,
-* `MOAB_JOBARRAYINDEX` for Adaptive's Moab, and
-* `SGE_TASKID` for SUN Grid Engine (SGE),
+* `MOAB_JOBARRAYINDEX` for Adaptive's Moab,
+* `SGE_TASKID` for SUN Grid Engine (SGE), and
 * `SLURM_ARRAY_TASK_ID` for Slurm workload manager.
 
 Typically, this task identifier is then use to determine, e.g., the
-specific input file for this task in the job script:
+specific input file for this task in the Slurm job script:
+
+```bash
+...
+INPUT_FILE="input-${SLURM_ARRAY_TASK_ID}.csv"
+...
+```
+
+Similarly, for a PBS Torque job script:
+
 ```bash
 ...
 INPUT_FILE="input-${PBS_ARRAYID}.csv"
@@ -20,14 +31,22 @@ INPUT_FILE="input-${PBS_ARRAYID}.csv"
 
 Submitting arrays jobs is quite simple.  For each of the supported queue
 systems and schedulers, one simply adds the `-t <int-range>` options to
-the submission command, `qsub` for PBS torque and SUN grid engine, `msub`
-for Moab, e.g., for PBS torque:
+the submission command, `qsub` for PBS torque, SUN grid engine, `msub`
+for Moab and `--array=<int-range>` to `sbatch` for Slurm, e.g., for Slurm:
+
 ```bash
-$ qsub  -t 1-250  bootstrap.pbs
+$ sbatch  --array=1-250  jobscript.slurm
 ```
+
+Similarly, for PBS torque:
+
+```bash 
+$ qsub  -t 1-250  jobscript.pbs
+```
+
 The submission command above would create a job array of 250 tasks, and
-for each the `PBS_ARRAYID` environment variable would be assigned a unique
-value between 1 and 250, inclusive.
+for each the `SLURM_ARRAY_TASK_ID` or the `PBS_ARRAYID` environment variable
+would be assigned a unique value between 1 and 250, inclusive.
 
 Although job arrays provide sufficient features for simple scenarios, it
 quickly becomes a nuisance for more sophisticated problems, especially in

diff --git a/examples/cleanup.sh b/examples/cleanup.sh
@@ -1,3 +1,3 @@
-#!/bin/bash
+#!/usr/bin/env -S bash -l
 
 rm -f *.pbs.* out-*.txt