Skip to content

Commit 10d97cd

Browse files
more detailed examples for Slurm and HTCondor
1 parent 66ee6d6 commit 10d97cd

File tree

1 file changed

+89
-11
lines changed

1 file changed

+89
-11
lines changed

src/using/batch.md

+89-11
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ This is plenty small enough to include in your `${HOME}` directory on most if no
3737
Additionally, most clusters share your `${HOME}` directory with the working nodes and so you don't even need to bother copying `denv` to where the jobs are being run.
3838

3939
## Preparing for Batch Running
40-
The above instructions have you setup to run `denv` on the cluster just like you run `denv` on your own computer; however,
41-
doing a few more steps is helpful to ensure that the batch jobs run reliably and efficiently.
40+
The above instructions have you setup to run `denv` on the cluster just like you run `denv` on your own computer;
41+
however, doing a few more steps is helpful to ensure that the batch jobs run reliably and efficiently.
4242

4343
### Pre-Building SIF Images
4444
Under-the-hood, `apptainer` runs images from SIF files.
@@ -54,11 +54,21 @@ cd path/to/big/dir
5454
apptainer build ldmx_pro_v4.2.3.sif docker://ldmx/pro:v4.2.3 # just an example, name the SIF file appropriately
5555
```
5656

57-
### Running the SIF Image
57+
## Running the SIF Image
5858
How we run the image during the jobs depends on how the jobs are configured.
5959
For the clusters I have access to (UMN and SLAC), there are two different ways for jobs to be configured
6060
that mainly change _where_ the job is run.
6161

62+
~~~admonish success title="Check Where Jobs are Run"
63+
A good way to figure this out (and learn about the batch job system that you want to use)
64+
is to figure out how to run a job that just runs `pwd`.
65+
This command prints out the "present working directory" and so you can see where
66+
the job is being run from.
67+
68+
Refer to your cluster's IT, documentation, and the batch job system's documentation to
69+
learn how to do this.
70+
~~~
71+
6272
#### Jobs Run In Submitted Directory
6373
At SLAC S3DF, the jobs submitted with `sbatch` are run from the directory where `sbatch` was run.
6474
This makes it rather easy to run jobs.
@@ -67,34 +77,102 @@ We can create a denv and then submit a job running `denv` from within that direc
6777
cd batch/submit/dir
6878
denv init /full/path/to/big/dir/ldmx_pro_v4.2.3.sif
6979
```
70-
Submitting the job would look like `sbatch <job-options> submit.sh` with
80+
81+
For example, submitting jobs for a range of run numbers would look like
7182
```shell
72-
# submit.sh
73-
denv fire config.py # inside ldmx/pro:v4.2.3 IF SUBMITTED FROM batch/submit/dir
83+
mkdir log # the SBATCH commands in submit put the log files here
84+
sbatch --array=0-10 submit.sh
85+
```
86+
with
87+
```bash
88+
#!/bin/bash
89+
#SBATCH --job-name my-job
90+
#SBATCH --cpus-per-task=1
91+
#SBATCH --mem-per-cpu=2g
92+
#SBATCH --time=04:00:00 # time limit for jobs
93+
#SBATCH --output=log/%A-%a.log
94+
#SBATCH --error=log/%A-%a.log
95+
96+
set -o errexit
97+
set -o nounset
98+
99+
# assume the configuration script config.py takes one argument
100+
# the run number it should use for the simulation
101+
# and then uniquely creates the path of the output file here
102+
denv fire config.py ${SLURM_ARRAY_TASK_ID}
103+
# fire is run inside ldmx/pro:v4.2.3 IF SUBMITTED FROM batch/submit/dir
74104
```
75105
Look at the SLAC S3DF and Slurm documentation to learn more about configuring the batch jobs themselves.
76106

107+
~~~admonish comments title="Comments"
108+
- _Technically_, since SLAC S3DF's `${SCRATCH}` directory is also shared across the worker nodes, you do not need to pre-build the image. However, this is not advised because if the `${SCRATCH}` directory is periodically cleaned during your jobs, the cached SIF image would be lost and your jobs could fail in confusing ways.
109+
- Some clusters configure Slurm to limit the number of jobs you can submit at once with `--array`. This means you might need to submit the jobs in "chunks" and add an offset to `SLURM_ARRAY_TASK_ID` so that the different "chunks" have different run numbers. This can be done with bash's math syntax e.g. `$(( SLURM_ARRAY_TASK_ID + 100 ))`.
110+
~~~
111+
77112
#### Jobs Run in Scratch Directory
78113
At UMN's CMS cluster, the jobs submitted with `condor_submit` are run from a newly-created scratch directory.
79114
This makes it slightly difficult to inform `denv` of the configuration we want to use.
80115
`denv` has an experimental shebang syntax that could be helpful for this purpose.
81116

82-
```shell
83-
#!/usr/bin/env denv shebang
117+
`prod.sh`
118+
```bash
119+
#!/full/path/to/denv shebang
84120
#!denv_image=/full/path/to/ldmx_pro_v4.2.3.sif
85121
#!bash
86122

123+
set -o nounset
124+
set -o errexit
125+
87126
# everything here is run in `bash` inside ldmx/pro:v4.2.3
88-
fire config.py
127+
# assume run number is provided as an argument
128+
fire config.py ${1}
129+
```
130+
131+
with the submit file `submit.sub` in the same directory.
132+
```
133+
# run prod.sh and transfer it to scratch area
134+
executable = prod.sh
135+
transfer_executable = yes
136+
137+
# terminal and condor output log files
138+
# helpful for debugging at slight performance cost
139+
output = logs/$(run_number)-$(Cluster)-$(Process).out
140+
error = $(output)
141+
log = $(Cluster)-condor.log
142+
143+
# "hold" the job if there is a non-zero exit code
144+
# and store the exit code in the hold reason subcode
145+
on_exit_hold = ExitCode != 0
146+
on_exit_hold_subcode = ExitCode
147+
on_exit_hold_reason = "Program exited with non-zero exit code"
148+
149+
# the 'Process' variable is an index for the job in the submission cluster
150+
arguments = "$(Process)"
151+
```
152+
And then you would `condor_submit` this script with
153+
```shell
154+
condor_submit submit.sub --queue 10
89155
```
90156

91-
And then you would `condor_submit` this script.
157+
~~~admonish note collapsible=true title="Alternative Script Design"
92158
Alternatively, one could write a script _around_ `denv` like
93159
```shell
160+
#!/bin/bash
161+
162+
set -o nounset
163+
set -o errexit
164+
94165
# stuff here is run outside ldmx/pro:v4.2.3
95166
# need to call `denv` to go into image
96167
denv init /full/path/to/ldmx_pro_v4.2.3.sif
97-
denv fire config
168+
denv fire config.py ${1}
98169
```
99170
The `denv init` call writes a few small files which shouldn't have a large impact on performance
100171
(but could if the directory in which the job is being run has a slow filesystem).
172+
This is helpful if your configuration of HTCondor does not do the file transfer for you and
173+
your job is responsible for copying in/out any input/output files that are necessary.
174+
~~~
175+
176+
~~~admonish note title="Comments"
177+
- Similar to Slurm's `--array`, we are relying on HTCondor's `-queue` command to decide what run numbers to use. Look at HTCondor's documentation (for example [Submitting many similar jobs with one queue command](https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#submitting-many-similar-jobs-with-one-queue-command)) for more information.
178+
~~~

0 commit comments

Comments
 (0)