Skip to content

Commit

Permalink
Merge pull request #36 from gjbex/development
Browse files Browse the repository at this point in the history
Slurmify documentation
  • Loading branch information
gjbex authored May 24, 2024
2 parents 788618d + 16a1bac commit 429490f
Show file tree
Hide file tree
Showing 15 changed files with 128 additions and 36 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,13 @@ common tasks with minimal modification of job scripts.


## Documentation

Full documentation is available on
[Read the Docs](http://atools.readthedocs.io/en/latest/).


## Important note

If you use job arrays on a HPC system that accounts for compute time,
remember that each job in the array is account as an individual job.
Depending on the number of cores used by a job, this may increase the
Expand All @@ -42,6 +44,7 @@ cost by a large factor compared to using the


## Requirements

`atools` requires at least Python 3.2, but only uses the standard
library.

Expand All @@ -51,17 +54,21 @@ that can be change during installation.


## Installing

After downloading and unpacking, simply run `configure` and `make`. For
details, see the [documentation](http://atools.readthedocs.io/en/latest/).


## Planned features

In no particular order...

* Template based job script creation
* Indexed data files for scaling to very large numbers of tasks.


## Contributors

* [Geert Jan Bex](mailto:[email protected]), Hasselt University/University
of Leuven
* Stefan Becuwe, Antwerp University
Expand Down
2 changes: 1 addition & 1 deletion configure.ac
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
AC_INIT([atools], [1.5.1], [[email protected]])
AC_INIT([atools], [1.5.2], [[email protected]])
AM_INIT_AUTOMAKE([-Wall -Werror foreign tar-pax])

AC_CONFIG_FILES([
Expand Down
14 changes: 8 additions & 6 deletions docs/acreate.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,30 @@
# Adding atools features using templates

Although the modifications required to use `atools` are fairly simple,
they involve some steps that my be unfamiliar to the casual user.
they involve some steps that may be unfamiliar to the casual user.

`acreate` adds everything required to use `atools` effectively to an
existing job script. By default, it will insert the `PATH` redefinition
to use the `atools` commands, and the logging of start and end events.
Suppose the original log file is called `bootstrap.pbs`, then the command
Suppose the original job script is called `jobscript.slurm`, then the command
to generate the file annotated for `atools` is:

```bash
$ acreate bootstrap.pbs > bootstrap_atools.pbs
$ acreate jobscript.slurm > jobscript_atools.slurm
```

If `aenv` is to be used, in addition to logging, you simply add the
`--data` option:
```bash
$ acreate --data data.csv -- bootstrap.pbs > bootstrap_atools.pbs
$ acreate --data data.csv -- jobscript.slurm > jobscript_atools.slurm
```

The default shell is the one specified in the configuration file, but
this can be overridden on the command line using the `--shell` option,
e.g., if `bootstrap.pbs` where a tcsh shell script, you would use
e.g., if `jobscript.slurm` where a tcsh shell script, you would use

```bash
$ acreate --shell tcsh bootstrap.pbs > bootstrap_atools.pbs
$ acreate --shell tcsh jobscript.slurm > jobscript_atools.slurm
```

It is also possible to supply your own template instead of the one provided
Expand Down
11 changes: 9 additions & 2 deletions docs/aenv.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Getting your parameters: `aenv`

The parameters for tasks can be stored in an CSV file, where the first
row is simply the names of the parameters, and each consecutive row
represents the values of these parameters for a specific experiment, i.e.,
Expand All @@ -7,19 +8,22 @@ computational task.
`aenv` will use the task identifier as an index into this CSV file, and
define environment variables with the appropriate values for that task.
As an example, consider the following PBS script:

```bash
#!/bin/bash
#!/usr/bin/env -S bash -l
...
alpha=0.5
beta=-1.3
Rscript bootstrap.R $alpha $beta
...
```

However, this approach would lead to as many job scripts are there are
parameter instances, which is inconvenient to say the least.

This computation would have to be done for many values for `alpha` and
`beta`. These values can be represented in an CSV file, `data.csv`:

```
alpha,beta
0.5,-1.3
Expand All @@ -28,15 +32,18 @@ alpha,beta
0.6,-1.3
...
```

The job script can be modified to automatically define the appropriate
values for `alpha` and `beta` specific for the task.

```bash
#!/bin/bash
#!/usr/bin/env -S bash -l
...
source <(aenv --data data.csv)
Rscript bootstrap.R $alpha $beta
...
```

`aenv` will use the value of the task identifier to read the corresponding
row in the `data.csv` CSV file, and export the variables `alpha` and `beta`
with those values.
Expand Down
9 changes: 7 additions & 2 deletions docs/aload.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Detailed job statistics

Gathering statistics about the execution time of tasks is straightforward
using `aload`. Given the log file(s) of a job, it will

Expand All @@ -12,9 +13,11 @@ or to report on. The second statistics may be helpful to estimate load
imbalance, and improve resource requests for future jobs.

Using `aload` is simple:

```bash
$ aload --log bootstrap.pbs.log10493
$ aload --log jobscript.slurm.log10493
```

It is not always useful to include failed items in the statistics since
their execution time may seriously skew the results. They can be excluded
by adding the `--no_failed` flag to the call to `aload`.
Expand All @@ -23,9 +26,11 @@ Sometimes it can be useful to compute more detailed statistics or plot
distributions of, e.g., the task execution time. It is beyond the scope
of `aload` to do this, but the data can be exported for further analysis
by adding the `--list_tasks` flag, i.e.,

```bash
$ aload --log bootstrap.pbs.log10493 --list_tasks
$ aload --log jobscript.slurm.log10493 --list_tasks
```

Similarly, for raw data on the slaves, add the `--list_slaves` flag.
If the output is to be imported in a software package, or parsed by a
script, it can be more convenient to obtain it in CSV format by adding the
Expand Down
12 changes: 9 additions & 3 deletions docs/alog.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Logging for fun and profit

Often, it is useful to log information about the execution of individual
tasks. This information can be used

Expand All @@ -16,29 +17,33 @@ centralized logging in a single file. This requires a POSIX compliant
shared file system when the job is running on multiple compute nodes.

Again, consider the fragment of the job script:

```bash
#!/bin/bash
#!/usr/bin/env -S bash -l
...
alpha=0.5
beta=-1.3
Rscript bootstrap.R $alpha $beta
...
```

(Note that `aenv` was not used here, which was done to stress the point
that `alog` and `aenv` are independent of one another, but obviously can
be combined.)

To enable logging, a call to `alog` is added as the first and the last
executable line of the job script, i.e.,

```bash
#!/bin/bash
#!/usr/bin/env -S bash -l
...
alog --state start
alpha=0.5
beta=-1.3
Rscript bootstrap.R $alpha $beta
alog --state end --exit $?
```

Here we assume that the exit status of the last actual job command
(`Rscript` in this example) is also the exit status of the task. The
Linux convention is that exit code 0 signifies success, any value between
Expand All @@ -52,14 +57,15 @@ The resulting log file is automatically created, and its name will be
the conventions of the queue system or scheduler used.

The log file will look like, e.g.,
```

```
1 started by r1i1n3 at 2016-09-02 11:47:45
2 started by r1i1n3 at 2016-09-02 11:47:45
3 started by r1i1n3 at 2016-09-02 11:47:46
2 failed by r1i1n3 at 2016-09-02 11:47:46: 1
3 completed by r1i1n3 at 2016-09-02 11:47:47
```

The format is `<task-id> <status> by <node-name> at <time-stamp>`, followed
by `: <exit-status>` for failed jobs. For this particular example, task
1 didn't complete, 2 failed, and 3 completed successfully.
Expand Down
13 changes: 8 additions & 5 deletions docs/arange.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Monitoring jobs and resuming tasks

Keeping track of the tasks already completed, successfully or not, or tasks
still pending can be somewhat annoying. Resuming tasks that were not
completed, or that failed requires a level of bookkeeping you may prefer
Expand All @@ -8,14 +9,16 @@ Note that for this to work, your job should do logging using
[`alog`](alog.md).

## Monitoring a running job

Given either the CSV file or the task identifier range for a job, and its
log file as generated by `alog`, `arange` will provide statistics on the
progress of a running job, or a summary on a completed job.

If the log file's name is `bootstrap.pbs.log10493`, and the job was based
If the log file's name is `jobscript.slurm.log10493`, and the job was based
on an CSV data file `data.csv`, a summary can be obtained by

```bash
$ arange --data data.csv --log bootstrap.pbs.log10493 --summary
$ arange --data data.csv --log jobscript.slurm.log10493 --summary
```
In case a job has been resumed, you should list all log files relevant to
the job to get correct results.
Expand All @@ -32,7 +35,7 @@ It can be switched off using the `--no_sniffer` option.
Of course, `arange` works independently of `aenv`, so it also supports
keeping track of general job arrays using the `-t` flag.
```bash
$ arange -t 1-250 --log bootstrap.pbs.log10493 --summary
$ arange -t 1-250 --log jobscript.slurm.log10493 --summary
```

Sometimes it is useful to explicitly list the task identifiers of either
Expand All @@ -45,11 +48,11 @@ identifiers should be redone when an array job did not complete, or when
some of its tasks failed. To get an identifier range of tasks that were
not completed, use
```bash
$ arange --data data.csv --log bootstrap.pbs.log10493`
$ arange --data data.csv --log jobscript.slurm.log10493`
```
or, when not using `aenv`
```bash
$ arange -t 1-250 --log bootstrap.pbs.log10493`
$ arange -t 1-250 --log jobscript.slurm.log10493`
```

If you want to include the tasks that failed, for instance when a bug that
Expand Down
22 changes: 17 additions & 5 deletions docs/areduce.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,20 @@ file, not replicated throughout the aggregated file. More complicated
aggregations, e.g., into an R dataframe required some programming.

Suppose that the output of each task is stored in a file with name
`out-{PBS_ARRAYID}.txt` where `PBS_ARRAYID` represents the array ID of
the respective task, and the final output should be a file `out.txt` that
is the concatenation of all the individual files.
`out-{PBS_ARRAYID}.txt` or `out-{SLURM_ARRAY_TASK_ID}.txt` where `PBS_ARRAYID`
or `SLURM_ARRAY_TASK_ID` represents the array ID of the respective task for PBS
Torque or Slurm respectively, and the final output should be a file `out.txt`
that is the concatenation of all the individual files.

```bash
$ areduce -t 1-250 --pattern 'out-{PBS_ARRAYID}.txt' --out out.txt
```

Similar for Slurm:
```bash
$ areduce -t 1-250 --pattern 'out-{SLURM_ARRAY_TASK_ID}.txt' --out out.txt
```

Although this could be easily achieved with `cat`, there are nevertheless
advantages to using `areduce` even in this very simple case. `areduce`
handles missing files (failed tasks) gracefully, whereas the corresponding
Expand All @@ -26,12 +34,14 @@ proper order of the files, while this would be cumbersome to do by hand.
If each of the output files were CSV files, the first line of each file
would contain the field names, that in the aggregated file should occur
only once as the first line.

```bash
$ areduce -t 1-250 --pattern 'out-{t}.csv' --out out.csv --mode csv
```

The command above will produce the desired CSV file without any hassle.
Note that the shorthand `t` for `PBS_ARRAYID` has been used in the file
name pattern specification.
Note that the shorthand `t` for `PBS_ARRAYID` or `SLURM_ARRAY_TASK_ID`
has been used in the file name pattern specification.

When one or more tasks failed, you may not want to aggregate the output of
those tasks since it may be incomplete and/or incorrect. In that case,
Expand All @@ -51,6 +61,7 @@ specified via the `--mode` option.

For example, the following command would aggregate data skipping three lines
at the top, and five lines at the bottom of each individual output file:

```bash
$ areduce -t 1-250 --pattern 'out-{t}.txt' --out out.txt \
--mode body --reduce_args '--h 3 -f 5'
Expand All @@ -67,6 +78,7 @@ Examples can be found in the `reduce` directory.

Arguments can be passed to the `empty` and `reduce` script as in the example
below:

```bash
$ areduce -t 1-250 --pattern 'out-{t}.csv' --out out.csv \
--empty my_empty --reduce my_reduce --reduce_args '--h 3'
Expand Down
35 changes: 27 additions & 8 deletions docs/job_arrays.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,28 @@
# What are job arrays?

A resource manager or scheduler that support job arrays typically
exposes a task identifier to the job script as an environment variable.
This is simply a number out of a range specified when the job is submitted.

For the resource managers and schedulers supported by `atools`, that would
be

* `PBS_ARRAYID` for PBS torque,
* `MOAB_JOBARRAYINDEX` for Adaptive's Moab, and
* `SGE_TASKID` for SUN Grid Engine (SGE),
* `MOAB_JOBARRAYINDEX` for Adaptive's Moab,
* `SGE_TASKID` for SUN Grid Engine (SGE), and
* `SLURM_ARRAY_TASK_ID` for Slurm workload manager.

Typically, this task identifier is then use to determine, e.g., the
specific input file for this task in the job script:
specific input file for this task in the Slurm job script:

```bash
...
INPUT_FILE="input-${SLURM_ARRAY_TASK_ID}.csv"
...
```

Similarly, for a PBS Torque job script:

```bash
...
INPUT_FILE="input-${PBS_ARRAYID}.csv"
Expand All @@ -20,14 +31,22 @@ INPUT_FILE="input-${PBS_ARRAYID}.csv"

Submitting arrays jobs is quite simple. For each of the supported queue
systems and schedulers, one simply adds the `-t <int-range>` options to
the submission command, `qsub` for PBS torque and SUN grid engine, `msub`
for Moab, e.g., for PBS torque:
the submission command, `qsub` for PBS torque, SUN grid engine, `msub`
for Moab and `--array=<int-range>` to `sbatch` for Slurm, e.g., for Slurm:

```bash
$ qsub -t 1-250 bootstrap.pbs
$ sbatch --array=1-250 jobscript.slurm
```

Similarly, for PBS torque:

```bash
$ qsub -t 1-250 jobscript.pbs
```

The submission command above would create a job array of 250 tasks, and
for each the `PBS_ARRAYID` environment variable would be assigned a unique
value between 1 and 250, inclusive.
for each the `SLURM_ARRAY_TASK_ID` or the `PBS_ARRAYID` environment variable
would be assigned a unique value between 1 and 250, inclusive.

Although job arrays provide sufficient features for simple scenarios, it
quickly becomes a nuisance for more sophisticated problems, especially in
Expand Down
2 changes: 1 addition & 1 deletion examples/cleanup.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/bash
#!/usr/bin/env -S bash -l

rm -f *.pbs.* out-*.txt
Loading

0 comments on commit 429490f

Please sign in to comment.