Skip to content

Commit

Permalink
DOC: update batch job sections in manual
Browse files Browse the repository at this point in the history
Nothing big, just clarify things here and there.
  • Loading branch information
elcorto committed May 6, 2024
1 parent f943772 commit 7325578
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 15 deletions.
39 changes: 29 additions & 10 deletions doc/source/written/manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -984,7 +984,8 @@ lets you spin up a dask cluster on distributed infrastructure.

With these tools, you have fine-grained control over how many batch jobs you
start. This is useful for cases where you have many small workloads (each
`func(pset)` runs only a few seconds, say). Then one batch job per `pset` would
`func(pset)` runs only a few seconds, say). Then one batch job per `pset`, as
in the [templates workflow](s:templates), would
be overhead and using many dask workers living in only a few (or one!) batch
job is more efficient.

Expand All @@ -998,13 +999,15 @@ which covers the most important `SLURMCluster` settings.

#### API

The `psweep` API to use `dask` is just `df=ps.run(..., dask_client=client)`.
The `psweep` API to use `dask` is `df=ps.run(..., dask_client=client)`.
First, let's look at an example for using `dask` locally.

```py
from dask.distributed import Client, LocalCluster

# The type of cluster depends on your compute infastructure. Replace
# LocalCluster with e.g. dask_jobqueue.SLURMCluster.
# LocalCluster with e.g. dask_jobqueue.SLURMCluster when running on a HPC
# machine.
cluster = LocalCluster()
client = Client(cluster)

Expand Down Expand Up @@ -1037,7 +1040,7 @@ client = Client()
is sufficient for running locally.


#### Example
#### HPC machine example

The example below uses the SLURM workload manager typically found in HPC
centers.
Expand Down Expand Up @@ -1097,7 +1100,7 @@ mybox$ ssh -L 2222:localhost:3333 hpc.machine.edu
mybox$ browser localhost:2222
```

If `dask_control` runs on a compute node, you may need a second tunnel:
If `dask_control` runs on a compute node, you will need a second tunnel:

```sh
mybox$ ssh -L 2222:localhost:3333 hpc.machine.edu
Expand Down Expand Up @@ -1150,6 +1153,7 @@ the dask dashboard and more.
`dask.distributed` and `dask_jobqueue`.
```

(s:templates)=
### Templates

This template-based workflow is basically a modernized
Expand All @@ -1168,10 +1172,17 @@ workloads. See the [Pros and Cons](s:template-pro-con) section below.
The central function to use is `ps.prep_batch()`. See `examples/batch_templates`
for a full example.

The workflow is based on **template files**. In the templates, we use
the standard library's `string.Template`, where each `$foo` is replaced by a
value contained in a pset, so `$param_a`, `$param_b`, as well as `$_pset_id`
and so forth.
The workflow is based on **template files**. In the templates, we use the
standard library's
[`string.Template`](https://docs.python.org/3/library/string.html#template-strings),
where each `$foo` is replaced by a value contained in a pset, so `$param_a`,
`$param_b`, as well as `$_pset_id` and so forth.

```{note}
If your template files are shell scripts that contain variables like `$foo`,
you need to escape the `$` with `$$foo`, else they will be treated as
placeholders.
```

We piggy-back on the `run()` workflow from above to, instead of running jobs
with it, just **create batch scripts using template files**.
Expand Down Expand Up @@ -1250,6 +1261,14 @@ templates
└── jobscript
```

The template workflow is very generic. One aspect of this design is that each
template file is treated as a simple text file, be it a Python script, a shell
script, a config file or anything else. Above we use a small Python script
`run.py` for demonstration purposes and communicate `pset` content (parameters
to vary) by replacing placeholders in there. See [this
section](s:template-pro-con) for other ways to improve this in the Python
script case.

##### calc templates

Each file in `templates/calc` such as `run.py` will be treated as
Expand Down Expand Up @@ -1332,7 +1351,7 @@ cd 11967c0d-7ce6-404f-aae6-2b0ea74beefa; sbatch jobscript_cluster; cd $here # r
#### git support

Use `prep_batch(..., git=True)` to have some basic git support such as
automatic commits in each call. It just uses `run(..., git=True)` when
automatic commits in each call. It uses `run(..., git=True)` when
creating batch scripts, so all best practices for that apply here as well. In
particular, make sure to create `.gitignore` first, else we'll track `calc/` as
well, which you may safely do when data in `calc` is small. Else use `git-lfs`,
Expand Down
6 changes: 3 additions & 3 deletions examples/batch_dask/dask_control.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
#SBATCH --time=2-0

# Submit this job manually via "sbatch <this_script>". It will run on some node
# and from there spin up the dask cluster defined in run_psweep.py. The dask
# cluster will then run workloads. After all work is done, the dask cluster
# will be teared down and this job will exit.
# and from there submit the batch jobs to spin up the dask cluster defined in
# run_psweep.py. The dask cluster will then run workloads. After all work is
# done, the dask cluster will be teared down and this job will exit.
#
# The reason for using a batch job to run a the dask control process (so python
# run_psweep.py) is that typically on HPC machines, there is a time limit for
Expand Down
6 changes: 4 additions & 2 deletions examples/batch_dask/run_psweep.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,10 @@ def func(pset):
##params = ps.pgrid([a])
params = a

# Start 2 batch jobs, each with 10 dask workers and 10 cores, so 1 thread /
# worker and 20 workers in total. Each worker gets 1 GiB of memory.
# Start 2 batch jobs, each with 10 dask workers (processes=10) and 10
# cores, so 1 core (1 thread) / worker and 20 workers in total (2 jobs x 10
# workers). Each worker gets 1 GiB of memory (memory="10GiB" for 10
# workers). See examples/batch_dask/slurm_cluster_settings.py
cluster.scale(jobs=2)
client = Client(cluster)
df = ps.run(func, params, dask_client=client)
Expand Down

0 comments on commit 7325578

Please sign in to comment.