diff --git a/README.md b/README.md index 00534f4..1dd1886 100644 --- a/README.md +++ b/README.md @@ -29,11 +29,13 @@ common tasks with minimal modification of job scripts. ## Documentation + Full documentation is available on [Read the Docs](http://atools.readthedocs.io/en/latest/). ## Important note + If you use job arrays on a HPC system that accounts for compute time, remember that each job in the array is account as an individual job. Depending on the number of cores used by a job, this may increase the @@ -42,6 +44,7 @@ cost by a large factor compared to using the ## Requirements + `atools` requires at least Python 3.2, but only uses the standard library. @@ -51,17 +54,21 @@ that can be change during installation. ## Installing + After downloading and unpacking, simply run `configure` and `make`. For details, see the [documentation](http://atools.readthedocs.io/en/latest/). ## Planned features + In no particular order... + * Template based job script creation * Indexed data files for scaling to very large numbers of tasks. ## Contributors + * [Geert Jan Bex](mailto:geertjan.bex@uhasselt.be), Hasselt University/University of Leuven * Stefan Becuwe, Antwerp University diff --git a/configure.ac b/configure.ac index de7a284..162c96d 100644 --- a/configure.ac +++ b/configure.ac @@ -1,4 +1,4 @@ -AC_INIT([atools], [1.5.1], [geertjan.bex@uhasselt.be]) +AC_INIT([atools], [1.5.2], [geertjan.bex@uhasselt.be]) AM_INIT_AUTOMAKE([-Wall -Werror foreign tar-pax]) AC_CONFIG_FILES([ diff --git a/docs/acreate.md b/docs/acreate.md index 315fab4..ab5bc25 100644 --- a/docs/acreate.md +++ b/docs/acreate.md @@ -1,28 +1,30 @@ # Adding atools features using templates Although the modifications required to use `atools` are fairly simple, -they involve some steps that my be unfamiliar to the casual user. +they involve some steps that may be unfamiliar to the casual user. `acreate` adds everything required to use `atools` effectively to an existing job script. By default, it will insert the `PATH` redefinition to use the `atools` commands, and the logging of start and end events. -Suppose the original log file is called `bootstrap.pbs`, then the command +Suppose the original job script is called `jobscript.slurm`, then the command to generate the file annotated for `atools` is: + ```bash -$ acreate bootstrap.pbs > bootstrap_atools.pbs +$ acreate jobscript.slurm > jobscript_atools.slurm ``` If `aenv` is to be used, in addition to logging, you simply add the `--data` option: ```bash -$ acreate --data data.csv -- bootstrap.pbs > bootstrap_atools.pbs +$ acreate --data data.csv -- jobscript.slurm > jobscript_atools.slurm ``` The default shell is the one specified in the configuration file, but this can be overridden on the command line using the `--shell` option, -e.g., if `bootstrap.pbs` where a tcsh shell script, you would use +e.g., if `jobscript.slurm` where a tcsh shell script, you would use + ```bash -$ acreate --shell tcsh bootstrap.pbs > bootstrap_atools.pbs +$ acreate --shell tcsh jobscript.slurm > jobscript_atools.slurm ``` It is also possible to supply your own template instead of the one provided diff --git a/docs/aenv.md b/docs/aenv.md index 729b949..1f72e0c 100644 --- a/docs/aenv.md +++ b/docs/aenv.md @@ -1,4 +1,5 @@ # Getting your parameters: `aenv` + The parameters for tasks can be stored in an CSV file, where the first row is simply the names of the parameters, and each consecutive row represents the values of these parameters for a specific experiment, i.e., @@ -7,19 +8,22 @@ computational task. `aenv` will use the task identifier as an index into this CSV file, and define environment variables with the appropriate values for that task. As an example, consider the following PBS script: + ```bash -#!/bin/bash +#!/usr/bin/env -S bash -l ... alpha=0.5 beta=-1.3 Rscript bootstrap.R $alpha $beta ... ``` + However, this approach would lead to as many job scripts are there are parameter instances, which is inconvenient to say the least. This computation would have to be done for many values for `alpha` and `beta`. These values can be represented in an CSV file, `data.csv`: + ``` alpha,beta 0.5,-1.3 @@ -28,15 +32,18 @@ alpha,beta 0.6,-1.3 ... ``` + The job script can be modified to automatically define the appropriate values for `alpha` and `beta` specific for the task. + ```bash -#!/bin/bash +#!/usr/bin/env -S bash -l ... source <(aenv --data data.csv) Rscript bootstrap.R $alpha $beta ... ``` + `aenv` will use the value of the task identifier to read the corresponding row in the `data.csv` CSV file, and export the variables `alpha` and `beta` with those values. diff --git a/docs/aload.md b/docs/aload.md index 3bc0ff3..72d4922 100644 --- a/docs/aload.md +++ b/docs/aload.md @@ -1,4 +1,5 @@ # Detailed job statistics + Gathering statistics about the execution time of tasks is straightforward using `aload`. Given the log file(s) of a job, it will @@ -12,9 +13,11 @@ or to report on. The second statistics may be helpful to estimate load imbalance, and improve resource requests for future jobs. Using `aload` is simple: + ```bash -$ aload --log bootstrap.pbs.log10493 +$ aload --log jobscript.slurm.log10493 ``` + It is not always useful to include failed items in the statistics since their execution time may seriously skew the results. They can be excluded by adding the `--no_failed` flag to the call to `aload`. @@ -23,9 +26,11 @@ Sometimes it can be useful to compute more detailed statistics or plot distributions of, e.g., the task execution time. It is beyond the scope of `aload` to do this, but the data can be exported for further analysis by adding the `--list_tasks` flag, i.e., + ```bash -$ aload --log bootstrap.pbs.log10493 --list_tasks +$ aload --log jobscript.slurm.log10493 --list_tasks ``` + Similarly, for raw data on the slaves, add the `--list_slaves` flag. If the output is to be imported in a software package, or parsed by a script, it can be more convenient to obtain it in CSV format by adding the diff --git a/docs/alog.md b/docs/alog.md index f6e7d13..2f20a8d 100644 --- a/docs/alog.md +++ b/docs/alog.md @@ -1,4 +1,5 @@ # Logging for fun and profit + Often, it is useful to log information about the execution of individual tasks. This information can be used @@ -16,22 +17,25 @@ centralized logging in a single file. This requires a POSIX compliant shared file system when the job is running on multiple compute nodes. Again, consider the fragment of the job script: + ```bash -#!/bin/bash +#!/usr/bin/env -S bash -l ... alpha=0.5 beta=-1.3 Rscript bootstrap.R $alpha $beta ... ``` + (Note that `aenv` was not used here, which was done to stress the point that `alog` and `aenv` are independent of one another, but obviously can be combined.) To enable logging, a call to `alog` is added as the first and the last executable line of the job script, i.e., + ```bash -#!/bin/bash +#!/usr/bin/env -S bash -l ... alog --state start alpha=0.5 @@ -39,6 +43,7 @@ beta=-1.3 Rscript bootstrap.R $alpha $beta alog --state end --exit $? ``` + Here we assume that the exit status of the last actual job command (`Rscript` in this example) is also the exit status of the task. The Linux convention is that exit code 0 signifies success, any value between @@ -52,14 +57,15 @@ The resulting log file is automatically created, and its name will be the conventions of the queue system or scheduler used. The log file will look like, e.g., -``` +``` 1 started by r1i1n3 at 2016-09-02 11:47:45 2 started by r1i1n3 at 2016-09-02 11:47:45 3 started by r1i1n3 at 2016-09-02 11:47:46 2 failed by r1i1n3 at 2016-09-02 11:47:46: 1 3 completed by r1i1n3 at 2016-09-02 11:47:47 ``` + The format is ` by at `, followed by `: ` for failed jobs. For this particular example, task 1 didn't complete, 2 failed, and 3 completed successfully. diff --git a/docs/arange.md b/docs/arange.md index 1826a89..6812921 100644 --- a/docs/arange.md +++ b/docs/arange.md @@ -1,4 +1,5 @@ # Monitoring jobs and resuming tasks + Keeping track of the tasks already completed, successfully or not, or tasks still pending can be somewhat annoying. Resuming tasks that were not completed, or that failed requires a level of bookkeeping you may prefer @@ -8,14 +9,16 @@ Note that for this to work, your job should do logging using [`alog`](alog.md). ## Monitoring a running job + Given either the CSV file or the task identifier range for a job, and its log file as generated by `alog`, `arange` will provide statistics on the progress of a running job, or a summary on a completed job. -If the log file's name is `bootstrap.pbs.log10493`, and the job was based +If the log file's name is `jobscript.slurm.log10493`, and the job was based on an CSV data file `data.csv`, a summary can be obtained by + ```bash -$ arange --data data.csv --log bootstrap.pbs.log10493 --summary +$ arange --data data.csv --log jobscript.slurm.log10493 --summary ``` In case a job has been resumed, you should list all log files relevant to the job to get correct results. @@ -32,7 +35,7 @@ It can be switched off using the `--no_sniffer` option. Of course, `arange` works independently of `aenv`, so it also supports keeping track of general job arrays using the `-t` flag. ```bash -$ arange -t 1-250 --log bootstrap.pbs.log10493 --summary +$ arange -t 1-250 --log jobscript.slurm.log10493 --summary ``` Sometimes it is useful to explicitly list the task identifiers of either @@ -45,11 +48,11 @@ identifiers should be redone when an array job did not complete, or when some of its tasks failed. To get an identifier range of tasks that were not completed, use ```bash -$ arange --data data.csv --log bootstrap.pbs.log10493` +$ arange --data data.csv --log jobscript.slurm.log10493` ``` or, when not using `aenv` ```bash -$ arange -t 1-250 --log bootstrap.pbs.log10493` +$ arange -t 1-250 --log jobscript.slurm.log10493` ``` If you want to include the tasks that failed, for instance when a bug that diff --git a/docs/areduce.md b/docs/areduce.md index bef7a11..e9f8dfb 100644 --- a/docs/areduce.md +++ b/docs/areduce.md @@ -11,12 +11,20 @@ file, not replicated throughout the aggregated file. More complicated aggregations, e.g., into an R dataframe required some programming. Suppose that the output of each task is stored in a file with name -`out-{PBS_ARRAYID}.txt` where `PBS_ARRAYID` represents the array ID of -the respective task, and the final output should be a file `out.txt` that -is the concatenation of all the individual files. +`out-{PBS_ARRAYID}.txt` or `out-{SLURM_ARRAY_TASK_ID}.txt` where `PBS_ARRAYID` +or `SLURM_ARRAY_TASK_ID` represents the array ID of the respective task for PBS +Torque or Slurm respectively, and the final output should be a file `out.txt` +that is the concatenation of all the individual files. + ```bash $ areduce -t 1-250 --pattern 'out-{PBS_ARRAYID}.txt' --out out.txt ``` + +Similar for Slurm: +```bash +$ areduce -t 1-250 --pattern 'out-{SLURM_ARRAY_TASK_ID}.txt' --out out.txt +``` + Although this could be easily achieved with `cat`, there are nevertheless advantages to using `areduce` even in this very simple case. `areduce` handles missing files (failed tasks) gracefully, whereas the corresponding @@ -26,12 +34,14 @@ proper order of the files, while this would be cumbersome to do by hand. If each of the output files were CSV files, the first line of each file would contain the field names, that in the aggregated file should occur only once as the first line. + ```bash $ areduce -t 1-250 --pattern 'out-{t}.csv' --out out.csv --mode csv ``` + The command above will produce the desired CSV file without any hassle. -Note that the shorthand `t` for `PBS_ARRAYID` has been used in the file -name pattern specification. +Note that the shorthand `t` for `PBS_ARRAYID` or `SLURM_ARRAY_TASK_ID` +has been used in the file name pattern specification. When one or more tasks failed, you may not want to aggregate the output of those tasks since it may be incomplete and/or incorrect. In that case, @@ -51,6 +61,7 @@ specified via the `--mode` option. For example, the following command would aggregate data skipping three lines at the top, and five lines at the bottom of each individual output file: + ```bash $ areduce -t 1-250 --pattern 'out-{t}.txt' --out out.txt \ --mode body --reduce_args '--h 3 -f 5' @@ -67,6 +78,7 @@ Examples can be found in the `reduce` directory. Arguments can be passed to the `empty` and `reduce` script as in the example below: + ```bash $ areduce -t 1-250 --pattern 'out-{t}.csv' --out out.csv \ --empty my_empty --reduce my_reduce --reduce_args '--h 3' diff --git a/docs/job_arrays.md b/docs/job_arrays.md index 4cd48a0..985de98 100644 --- a/docs/job_arrays.md +++ b/docs/job_arrays.md @@ -1,17 +1,28 @@ # What are job arrays? + A resource manager or scheduler that support job arrays typically exposes a task identifier to the job script as an environment variable. This is simply a number out of a range specified when the job is submitted. For the resource managers and schedulers supported by `atools`, that would be + * `PBS_ARRAYID` for PBS torque, -* `MOAB_JOBARRAYINDEX` for Adaptive's Moab, and -* `SGE_TASKID` for SUN Grid Engine (SGE), +* `MOAB_JOBARRAYINDEX` for Adaptive's Moab, +* `SGE_TASKID` for SUN Grid Engine (SGE), and * `SLURM_ARRAY_TASK_ID` for Slurm workload manager. Typically, this task identifier is then use to determine, e.g., the -specific input file for this task in the job script: +specific input file for this task in the Slurm job script: + +```bash +... +INPUT_FILE="input-${SLURM_ARRAY_TASK_ID}.csv" +... +``` + +Similarly, for a PBS Torque job script: + ```bash ... INPUT_FILE="input-${PBS_ARRAYID}.csv" @@ -20,14 +31,22 @@ INPUT_FILE="input-${PBS_ARRAYID}.csv" Submitting arrays jobs is quite simple. For each of the supported queue systems and schedulers, one simply adds the `-t ` options to -the submission command, `qsub` for PBS torque and SUN grid engine, `msub` -for Moab, e.g., for PBS torque: +the submission command, `qsub` for PBS torque, SUN grid engine, `msub` +for Moab and `--array=` to `sbatch` for Slurm, e.g., for Slurm: + ```bash -$ qsub -t 1-250 bootstrap.pbs +$ sbatch --array=1-250 jobscript.slurm ``` + +Similarly, for PBS torque: + +```bash +$ qsub -t 1-250 jobscript.pbs +``` + The submission command above would create a job array of 250 tasks, and -for each the `PBS_ARRAYID` environment variable would be assigned a unique -value between 1 and 250, inclusive. +for each the `SLURM_ARRAY_TASK_ID` or the `PBS_ARRAYID` environment variable +would be assigned a unique value between 1 and 250, inclusive. Although job arrays provide sufficient features for simple scenarios, it quickly becomes a nuisance for more sophisticated problems, especially in diff --git a/examples/cleanup.sh b/examples/cleanup.sh index 57c3ac2..a1c487d 100755 --- a/examples/cleanup.sh +++ b/examples/cleanup.sh @@ -1,3 +1,3 @@ -#!/bin/bash +#!/usr/bin/env -S bash -l rm -f *.pbs.* out-*.txt diff --git a/examples/submit.sh b/examples/submit_pbs.sh similarity index 82% rename from examples/submit.sh rename to examples/submit_pbs.sh index a4628c3..4601ef1 100755 --- a/examples/submit.sh +++ b/examples/submit_pbs.sh @@ -1,4 +1,4 @@ -#!/bin/bash +#!/usr/bin/env -S bash -l # set PATH to find arange executable PATH="../bin:$PATH" diff --git a/examples/submit_slurm.sh b/examples/submit_slurm.sh new file mode 100755 index 0000000..96caddd --- /dev/null +++ b/examples/submit_slurm.sh @@ -0,0 +1,7 @@ +#!/usr/bin/env -S bash -l + +# set PATH to find arange executable +PATH="../bin:$PATH" + +array_ids=$(arange --data data.csv) +sbatch --array=${array_ids} test.slurm diff --git a/examples/test.pbs b/examples/test.pbs index 0d82b4f..0397c1c 100644 --- a/examples/test.pbs +++ b/examples/test.pbs @@ -1,4 +1,4 @@ -#!/bin/bash -l +#!/usr/bin/env -S bash -l #PBS -l nodes=1:ppn=1 #PBS -l walltime=00:05:00 diff --git a/examples/test.slurm b/examples/test.slurm new file mode 100644 index 0000000..2e9e25e --- /dev/null +++ b/examples/test.slurm @@ -0,0 +1,24 @@ +#!/usr/bin/env -S bash -l +#SBATCH --nodes=1 --ntasks=1 +#SBATCH --time=00:05:00 + +# not needed in real script, this is only to localize test +PATH="../bin:$PATH" + +# log start of work item execution +alog --state start + +# define work item parameters +source <(aenv --data data.csv) + +# do actual work, i.e., original Slurm script +echo "alpha = $alpha" +echo "beta = $beta" +echo "gamma = $gamma" +exit_code=$(( $RANDOM % 2 )) + +result=$(echo "$alpha + $beta + $gamma" | bc -l) +echo "$alpha,$beta,$gamma,$result" > "out-${PBS_ARRAYID}.txt" + +# log end of work item execution +alog --state end --exit "$exit_code" diff --git a/mkdocs.yml b/mkdocs.yml index 70e854d..b9141e7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,7 +1,7 @@ site_name: atools (Job array tools) documentation site_description: atools is a set of utilities for using job arrays in a very convenient way. site_author: Geert Jan Bex -pages: +nav: - Introduction and motivation: 'index.md' - What are job arrays?: 'job_arrays.md' - Getting your parameters: 'aenv.md'