You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/using/batch.md
+89-11
Original file line number
Diff line number
Diff line change
@@ -37,8 +37,8 @@ This is plenty small enough to include in your `${HOME}` directory on most if no
37
37
Additionally, most clusters share your `${HOME}` directory with the working nodes and so you don't even need to bother copying `denv` to where the jobs are being run.
38
38
39
39
## Preparing for Batch Running
40
-
The above instructions have you setup to run `denv` on the cluster just like you run `denv` on your own computer; however,
41
-
doing a few more steps is helpful to ensure that the batch jobs run reliably and efficiently.
40
+
The above instructions have you setup to run `denv` on the cluster just like you run `denv` on your own computer;
41
+
however, doing a few more steps is helpful to ensure that the batch jobs run reliably and efficiently.
42
42
43
43
### Pre-Building SIF Images
44
44
Under-the-hood, `apptainer` runs images from SIF files.
@@ -54,11 +54,21 @@ cd path/to/big/dir
54
54
apptainer build ldmx_pro_v4.2.3.sif docker://ldmx/pro:v4.2.3 # just an example, name the SIF file appropriately
55
55
```
56
56
57
-
###Running the SIF Image
57
+
## Running the SIF Image
58
58
How we run the image during the jobs depends on how the jobs are configured.
59
59
For the clusters I have access to (UMN and SLAC), there are two different ways for jobs to be configured
60
60
that mainly change _where_ the job is run.
61
61
62
+
~~~admonish success title="Check Where Jobs are Run"
63
+
A good way to figure this out (and learn about the batch job system that you want to use)
64
+
is to figure out how to run a job that just runs `pwd`.
65
+
This command prints out the "present working directory" and so you can see where
66
+
the job is being run from.
67
+
68
+
Refer to your cluster's IT, documentation, and the batch job system's documentation to
69
+
learn how to do this.
70
+
~~~
71
+
62
72
#### Jobs Run In Submitted Directory
63
73
At SLAC S3DF, the jobs submitted with `sbatch` are run from the directory where `sbatch` was run.
64
74
This makes it rather easy to run jobs.
@@ -67,34 +77,102 @@ We can create a denv and then submit a job running `denv` from within that direc
Submitting the job would look like `sbatch <job-options> submit.sh` with
80
+
81
+
For example, submitting jobs for a range of run numbers would look like
71
82
```shell
72
-
# submit.sh
73
-
denv fire config.py # inside ldmx/pro:v4.2.3 IF SUBMITTED FROM batch/submit/dir
83
+
mkdir log # the SBATCH commands in submit put the log files here
84
+
sbatch --array=0-10 submit.sh
85
+
```
86
+
with
87
+
```bash
88
+
#!/bin/bash
89
+
#SBATCH --job-name my-job
90
+
#SBATCH --cpus-per-task=1
91
+
#SBATCH --mem-per-cpu=2g
92
+
#SBATCH --time=04:00:00 # time limit for jobs
93
+
#SBATCH --output=log/%A-%a.log
94
+
#SBATCH --error=log/%A-%a.log
95
+
96
+
set -o errexit
97
+
set -o nounset
98
+
99
+
# assume the configuration script config.py takes one argument
100
+
# the run number it should use for the simulation
101
+
# and then uniquely creates the path of the output file here
102
+
denv fire config.py ${SLURM_ARRAY_TASK_ID}
103
+
# fire is run inside ldmx/pro:v4.2.3 IF SUBMITTED FROM batch/submit/dir
74
104
```
75
105
Look at the SLAC S3DF and Slurm documentation to learn more about configuring the batch jobs themselves.
76
106
107
+
~~~admonish comments title="Comments"
108
+
- _Technically_, since SLAC S3DF's `${SCRATCH}` directory is also shared across the worker nodes, you do not need to pre-build the image. However, this is not advised because if the `${SCRATCH}` directory is periodically cleaned during your jobs, the cached SIF image would be lost and your jobs could fail in confusing ways.
109
+
- Some clusters configure Slurm to limit the number of jobs you can submit at once with `--array`. This means you might need to submit the jobs in "chunks" and add an offset to `SLURM_ARRAY_TASK_ID` so that the different "chunks" have different run numbers. This can be done with bash's math syntax e.g. `$(( SLURM_ARRAY_TASK_ID + 100 ))`.
110
+
~~~
111
+
77
112
#### Jobs Run in Scratch Directory
78
113
At UMN's CMS cluster, the jobs submitted with `condor_submit` are run from a newly-created scratch directory.
79
114
This makes it slightly difficult to inform `denv` of the configuration we want to use.
80
115
`denv` has an experimental shebang syntax that could be helpful for this purpose.
81
116
82
-
```shell
83
-
#!/usr/bin/env denv shebang
117
+
`prod.sh`
118
+
```bash
119
+
#!/full/path/to/denv shebang
84
120
#!denv_image=/full/path/to/ldmx_pro_v4.2.3.sif
85
121
#!bash
86
122
123
+
set -o nounset
124
+
set -o errexit
125
+
87
126
# everything here is run in `bash` inside ldmx/pro:v4.2.3
88
-
fire config.py
127
+
# assume run number is provided as an argument
128
+
fire config.py ${1}
129
+
```
130
+
131
+
with the submit file `submit.sub` in the same directory.
132
+
```
133
+
# run prod.sh and transfer it to scratch area
134
+
executable = prod.sh
135
+
transfer_executable = yes
136
+
137
+
# terminal and condor output log files
138
+
# helpful for debugging at slight performance cost
Alternatively, one could write a script _around_ `denv` like
93
159
```shell
160
+
#!/bin/bash
161
+
162
+
set -o nounset
163
+
set -o errexit
164
+
94
165
# stuff here is run outside ldmx/pro:v4.2.3
95
166
# need to call `denv` to go into image
96
167
denv init /full/path/to/ldmx_pro_v4.2.3.sif
97
-
denv fire config
168
+
denv fire config.py ${1}
98
169
```
99
170
The `denv init` call writes a few small files which shouldn't have a large impact on performance
100
171
(but could if the directory in which the job is being run has a slow filesystem).
172
+
This is helpful if your configuration of HTCondor does not do the file transfer for you and
173
+
your job is responsible for copying in/out any input/output files that are necessary.
174
+
~~~
175
+
176
+
~~~admonish note title="Comments"
177
+
- Similar to Slurm's `--array`, we are relying on HTCondor's `-queue` command to decide what run numbers to use. Look at HTCondor's documentation (for example [Submitting many similar jobs with one queue command](https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#submitting-many-similar-jobs-with-one-queue-command)) for more information.
0 commit comments