DOC: update batch job sections in manual

Nothing big, just clarify things here and there.
elcorto · May 6, 2024 · 7325578 · 7325578
1 parent f943772
commit 7325578
Show file tree

Hide file tree

Showing 3 changed files with 36 additions and 15 deletions.
diff --git a/doc/source/written/manual.md b/doc/source/written/manual.md
@@ -984,7 +984,8 @@ lets you spin up a dask cluster on distributed infrastructure.
 
 With these tools, you have fine-grained control over how many batch jobs you
 start. This is useful for cases where you have many small workloads (each
-`func(pset)` runs only a few seconds, say). Then one batch job per `pset` would
+`func(pset)` runs only a few seconds, say). Then one batch job per `pset`, as
+in the [templates workflow](s:templates), would
 be overhead and using many dask workers living in only a few (or one!) batch
 job is more efficient.
 
@@ -998,13 +999,15 @@ which covers the most important `SLURMCluster` settings.
 
 #### API
 
-The `psweep` API to use `dask` is just `df=ps.run(..., dask_client=client)`.
+The `psweep` API to use `dask` is `df=ps.run(..., dask_client=client)`.
+First, let's look at an example for using `dask` locally.
 
 ```py
 from dask.distributed import Client, LocalCluster
 
 # The type of cluster depends on your compute infastructure. Replace
-# LocalCluster with e.g. dask_jobqueue.SLURMCluster.
+# LocalCluster with e.g. dask_jobqueue.SLURMCluster when running on a HPC
+# machine.
 cluster = LocalCluster()
 client = Client(cluster)
 
@@ -1037,7 +1040,7 @@ client = Client()
 is sufficient for running locally.
 
 
-#### Example
+#### HPC machine example
 
 The example below uses the SLURM workload manager typically found in HPC
 centers.
@@ -1097,7 +1100,7 @@ mybox$ ssh -L 2222:localhost:3333 hpc.machine.edu
 mybox$ browser localhost:2222
 ```
 
-If `dask_control` runs on a compute node, you may need a second tunnel:
+If `dask_control` runs on a compute node, you will need a second tunnel:
 
 ```sh
 mybox$ ssh -L 2222:localhost:3333 hpc.machine.edu
@@ -1150,6 +1153,7 @@ the dask dashboard and more.
   `dask.distributed` and `dask_jobqueue`.
 ```
 
+(s:templates)=
 ### Templates
 
 This template-based workflow is basically a modernized
@@ -1168,10 +1172,17 @@ workloads. See the [Pros and Cons](s:template-pro-con) section below.
 The central function to use is `ps.prep_batch()`. See `examples/batch_templates`
 for a full example.
 
-The workflow is based on **template files**. In the templates, we use
-the standard library's `string.Template`, where each `$foo` is replaced by a
-value contained in a pset, so `$param_a`, `$param_b`, as well as `$_pset_id`
-and so forth.
+The workflow is based on **template files**. In the templates, we use the
+standard library's
+[`string.Template`](https://docs.python.org/3/library/string.html#template-strings),
+where each `$foo` is replaced by a value contained in a pset, so `$param_a`,
+`$param_b`, as well as `$_pset_id` and so forth.
+
+```{note}
+If your template files are shell scripts that contain variables like `$foo`,
+you need to escape the `$` with `$$foo`, else they will be treated as
+placeholders.
+```
 
 We piggy-back on the `run()` workflow from above to, instead of running jobs
 with it, just **create batch scripts using template files**.
@@ -1250,6 +1261,14 @@ templates
         └── jobscript
 ```
 
+The template workflow is very generic. One aspect of this design is that each
+template file is treated as a simple text file, be it a Python script, a shell
+script, a config file or anything else. Above we use a small Python script
+`run.py` for demonstration purposes and communicate `pset` content (parameters
+to vary) by replacing placeholders in there. See [this
+section](s:template-pro-con) for other ways to improve this in the Python
+script case.
+
 ##### calc templates
 
 Each file in `templates/calc` such as `run.py` will be treated as
@@ -1332,7 +1351,7 @@ cd 11967c0d-7ce6-404f-aae6-2b0ea74beefa; sbatch jobscript_cluster; cd $here  # r
 #### git support
 
 Use `prep_batch(..., git=True)` to have some basic git support such as
-automatic commits in each call. It just uses `run(..., git=True)` when
+automatic commits in each call. It uses `run(..., git=True)` when
 creating batch scripts, so all best practices for that apply here as well. In
 particular, make sure to create `.gitignore` first, else we'll track `calc/` as
 well, which you may safely do when data in `calc` is small. Else use `git-lfs`,

diff --git a/examples/batch_dask/dask_control.slurm b/examples/batch_dask/dask_control.slurm
@@ -9,9 +9,9 @@
 #SBATCH --time=2-0
 
 # Submit this job manually via "sbatch <this_script>". It will run on some node
-# and from there spin up the dask cluster defined in run_psweep.py. The dask
-# cluster will then run workloads. After all work is done, the dask cluster
-# will be teared down and this job will exit.
+# and from there submit the batch jobs to spin up the dask cluster defined in
+# run_psweep.py. The dask cluster will then run workloads. After all work is
+# done, the dask cluster will be teared down and this job will exit.
 #
 # The reason for using a batch job to run a the dask control process (so python
 # run_psweep.py) is that typically on HPC machines, there is a time limit for

diff --git a/examples/batch_dask/run_psweep.py b/examples/batch_dask/run_psweep.py
@@ -40,8 +40,10 @@ def func(pset):
     ##params = ps.pgrid([a])
     params = a
 
-    # Start 2 batch jobs, each with 10 dask workers and 10 cores, so 1 thread /
-    # worker and 20 workers in total. Each worker gets 1 GiB of memory.
+    # Start 2 batch jobs, each with 10 dask workers (processes=10) and 10
+    # cores, so 1 core (1 thread) / worker and 20 workers in total (2 jobs x 10
+    # workers). Each worker gets 1 GiB of memory (memory="10GiB" for 10
+    # workers). See examples/batch_dask/slurm_cluster_settings.py
     cluster.scale(jobs=2)
     client = Client(cluster)
     df = ps.run(func, params, dask_client=client)