Skip to content

Commit 4cc1d33

Browse files
Merge pull request #50 from LDMX-Software/batch-computing
add batch computing page
2 parents 90eed61 + 10d97cd commit 4cc1d33

File tree

3 files changed

+182
-38
lines changed

3 files changed

+182
-38
lines changed

src/SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
- [Dark Brem Signal Samples](using/dark-brem/intro.md)
2020
- [How to Run](using/dark-brem/how-to.md)
2121
- [Legacy Instructions](using/dark-brem/legacy.md)
22+
- [Batch Computing](using/batch.md)
2223

2324
# Physics Guides
2425
- [Statistics and Calculations](physics/stats/intro.md)

src/developing/custom-production-image.md

+3-38
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,9 @@
1-
# Using a Custom Production Image
1+
# Building a Custom Production Image
22

33
Using a container has many advantages and one of them is the ability to develop code on one machine (e.g. your personal laptop),
44
but the deploy the _exact same code_ to run on several other computers (e.g. SLAC batch computing).
5-
This page details a process you can follow to generate your own production image that has your developments of ldmx-sw inside of it and has two big parts.
6-
7-
1. Building a custom production image with your developments
8-
2. Running any production image (with specific focus on using `singularity`).
9-
10-
## 1. Building a Custom Production Image
5+
This page details a process you can follow to generate your own production image that has your developments of ldmx-sw inside of it.
6+
Refer to the [Batch Computing](../using/batch.md) for how to use these production images.
117

128
Building a docker image is complicated, but hopefully you can get one of the two methods listed below to work for your specific case.
139
The common denominator in these methods is that you *need* to have a DockerHub repository that you have administrator access to.
@@ -61,34 +57,3 @@ _Note: If you haven't yet, you may need to `docker login` on your computer for t
6157
```
6258
docker push docker-user-name/docker-repo-name:some-tag
6359
```
64-
65-
## 2. Running the Production Image on the Batch Computer
66-
0. Decide where you want to save the production image: `export LDMX_PRODUCTION_IMG=$(pwd -P)/ldmx_my_production_image_some_tag.sif`
67-
1. Pull the docker container down from Docker Hub and build it into a `.sif` image file. _Note: This step requires your Docker Hub repository to be public._
68-
```
69-
singularity build ${LDMX_PRODUCTION_IMG} docker://docker-user-name/docker-repo-name:some-tag
70-
```
71-
2. Now you can run a configuration script with your developments in the container using
72-
```
73-
singularity run --no-home ${LDMX_PRODUCTION_IMG} . config.py
74-
```
75-
This is the command you want to be giving to `bsub` or some other submission program.
76-
The only files it needs access to are the configuration script that you want to run and the `.sif` image file;
77-
both of which are only used at the start-up of the container.
78-
79-
_Note: On SLAC computers, the default singularity cache directory is $HOME, but SLAC users are not given very much space in $HOME. It may help your singularity build and run commands if you change the cache directory 'SINGULARITY_CACHEDIR' to somewhere with more space._
80-
81-
## 3. Submission Script
82-
It is best practice to write a "submission script" that handles the running of this command _and_ any pre- or post- run actions.
83-
A lot of different submission scripts have been written in `bash` and `python`, but they all have a similar structure:
84-
1. Setup the batch environment (e.g. Find singularity image file and turn off email notifications)
85-
2. Configure or write a job script which does all the pre- and post- run actions as well as the `singularity run` command.
86-
- Go to a scratch or temporary directory to work
87-
- Pre-Run Actions: copying over input file, inserting parameters into configuration script, etc...
88-
- Run `singularity run` command
89-
- Post-Run Actions: copying output files to output directory cleaning up scractch directory
90-
3. Submit the job script using the submission program (e.g. `bsub` or `condor`) however many times
91-
92-
The `batch` directory in the [LDMX-Software/ldmx-sw-scripts](https://github.com/LDMX-Software/ldmx-sw-scripts)
93-
repository offers some examples of these submission scripts, although they tend to be a little old
94-
and will need to be updated.

src/using/batch.md

+178
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Batch Computing
2+
3+
The academic clusters that we have access to mostly have `apptainer` installed which we can use to run the images with ldmx-sw built into them.
4+
We use `denv` when running the images manually and, fortunately, it is small enough to deploy onto the clusters as well.[^1]
5+
```shell
6+
# on the cluster you want to run batch jobs
7+
curl -s https://tomeichlersmith.github.io/denv/install | sh
8+
```
9+
10+
~~~admonish tip title="Image Storage"
11+
While the `${HOME}` directory is large enough to hold the installation of `denv`,
12+
they are usually much too small to hold copies of the images that we want to run.
13+
For this reason, you will likely want to edit your shell configuration (e.g. `~/.bashrc`)
14+
to change where `apptainer` will store the images.
15+
Refer to your cluster's IT help or documentation to find a suitable place to hold these images.
16+
For example, [the S3DF cluster at SLAC](https://s3df.slac.stanford.edu/#/reference?id=apptainer)
17+
suggests using the `${SCRATCH}` variable they define for their users.
18+
```shell
19+
export APPTAINER_LOCALCACHEDIR=${SCRATCH}/.apptainer
20+
export APPTAINER_CACHEDIR=${SCRATCH}/.apptainer
21+
export APPTAINER_TMPDIR=${SCRATCH}/.apptainer
22+
```
23+
~~~
24+
25+
~~~admonish success title="Test"
26+
With `denv` installed on the cluster, you should be able to run `denv` like normal manually.
27+
For example, you can test run a light image that is fast to download.
28+
```
29+
denv init alpine:latest
30+
denv cat /etc/os-release
31+
# should say "Alpine" instead of the host OS
32+
```
33+
~~~
34+
35+
[^1]: The total disk footprint of a `denv` installation is 120KB.
36+
This is plenty small enough to include in your `${HOME}` directory on most if not all clusters.
37+
Additionally, most clusters share your `${HOME}` directory with the working nodes and so you don't even need to bother copying `denv` to where the jobs are being run.
38+
39+
## Preparing for Batch Running
40+
The above instructions have you setup to run `denv` on the cluster just like you run `denv` on your own computer;
41+
however, doing a few more steps is helpful to ensure that the batch jobs run reliably and efficiently.
42+
43+
### Pre-Building SIF Images
44+
Under-the-hood, `apptainer` runs images from SIF files.
45+
When `denv` runs using the image tage (e.g. `ldmx/pro:v4.2.3`), `apptainer` stores a copy of this image in a SIF file inside of the cache directory.
46+
While the cache directory is distributed across the worker nodes on some clusters, it is not distributed on all clusters, so pre-building the image ourselves
47+
into a known location is helpful.
48+
49+
The location for the image should be big enough to hold the multi-GB image (so probably not your `${HOME}` directory) _and_ needs to be shared with the computers that run the jobs.
50+
Again, check with your IT or cluster documentation to see a precise location.
51+
At SLAC's S3DF, `/sdf/group/ldmx` can be a good location (and may already have the image you need built!).
52+
```
53+
cd path/to/big/dir
54+
apptainer build ldmx_pro_v4.2.3.sif docker://ldmx/pro:v4.2.3 # just an example, name the SIF file appropriately
55+
```
56+
57+
## Running the SIF Image
58+
How we run the image during the jobs depends on how the jobs are configured.
59+
For the clusters I have access to (UMN and SLAC), there are two different ways for jobs to be configured
60+
that mainly change _where_ the job is run.
61+
62+
~~~admonish success title="Check Where Jobs are Run"
63+
A good way to figure this out (and learn about the batch job system that you want to use)
64+
is to figure out how to run a job that just runs `pwd`.
65+
This command prints out the "present working directory" and so you can see where
66+
the job is being run from.
67+
68+
Refer to your cluster's IT, documentation, and the batch job system's documentation to
69+
learn how to do this.
70+
~~~
71+
72+
#### Jobs Run In Submitted Directory
73+
At SLAC S3DF, the jobs submitted with `sbatch` are run from the directory where `sbatch` was run.
74+
This makes it rather easy to run jobs.
75+
We can create a denv and then submit a job running `denv` from within that directory.
76+
```
77+
cd batch/submit/dir
78+
denv init /full/path/to/big/dir/ldmx_pro_v4.2.3.sif
79+
```
80+
81+
For example, submitting jobs for a range of run numbers would look like
82+
```shell
83+
mkdir log # the SBATCH commands in submit put the log files here
84+
sbatch --array=0-10 submit.sh
85+
```
86+
with
87+
```bash
88+
#!/bin/bash
89+
#SBATCH --job-name my-job
90+
#SBATCH --cpus-per-task=1
91+
#SBATCH --mem-per-cpu=2g
92+
#SBATCH --time=04:00:00 # time limit for jobs
93+
#SBATCH --output=log/%A-%a.log
94+
#SBATCH --error=log/%A-%a.log
95+
96+
set -o errexit
97+
set -o nounset
98+
99+
# assume the configuration script config.py takes one argument
100+
# the run number it should use for the simulation
101+
# and then uniquely creates the path of the output file here
102+
denv fire config.py ${SLURM_ARRAY_TASK_ID}
103+
# fire is run inside ldmx/pro:v4.2.3 IF SUBMITTED FROM batch/submit/dir
104+
```
105+
Look at the SLAC S3DF and Slurm documentation to learn more about configuring the batch jobs themselves.
106+
107+
~~~admonish comments title="Comments"
108+
- _Technically_, since SLAC S3DF's `${SCRATCH}` directory is also shared across the worker nodes, you do not need to pre-build the image. However, this is not advised because if the `${SCRATCH}` directory is periodically cleaned during your jobs, the cached SIF image would be lost and your jobs could fail in confusing ways.
109+
- Some clusters configure Slurm to limit the number of jobs you can submit at once with `--array`. This means you might need to submit the jobs in "chunks" and add an offset to `SLURM_ARRAY_TASK_ID` so that the different "chunks" have different run numbers. This can be done with bash's math syntax e.g. `$(( SLURM_ARRAY_TASK_ID + 100 ))`.
110+
~~~
111+
112+
#### Jobs Run in Scratch Directory
113+
At UMN's CMS cluster, the jobs submitted with `condor_submit` are run from a newly-created scratch directory.
114+
This makes it slightly difficult to inform `denv` of the configuration we want to use.
115+
`denv` has an experimental shebang syntax that could be helpful for this purpose.
116+
117+
`prod.sh`
118+
```bash
119+
#!/full/path/to/denv shebang
120+
#!denv_image=/full/path/to/ldmx_pro_v4.2.3.sif
121+
#!bash
122+
123+
set -o nounset
124+
set -o errexit
125+
126+
# everything here is run in `bash` inside ldmx/pro:v4.2.3
127+
# assume run number is provided as an argument
128+
fire config.py ${1}
129+
```
130+
131+
with the submit file `submit.sub` in the same directory.
132+
```
133+
# run prod.sh and transfer it to scratch area
134+
executable = prod.sh
135+
transfer_executable = yes
136+
137+
# terminal and condor output log files
138+
# helpful for debugging at slight performance cost
139+
output = logs/$(run_number)-$(Cluster)-$(Process).out
140+
error = $(output)
141+
log = $(Cluster)-condor.log
142+
143+
# "hold" the job if there is a non-zero exit code
144+
# and store the exit code in the hold reason subcode
145+
on_exit_hold = ExitCode != 0
146+
on_exit_hold_subcode = ExitCode
147+
on_exit_hold_reason = "Program exited with non-zero exit code"
148+
149+
# the 'Process' variable is an index for the job in the submission cluster
150+
arguments = "$(Process)"
151+
```
152+
And then you would `condor_submit` this script with
153+
```shell
154+
condor_submit submit.sub --queue 10
155+
```
156+
157+
~~~admonish note collapsible=true title="Alternative Script Design"
158+
Alternatively, one could write a script _around_ `denv` like
159+
```shell
160+
#!/bin/bash
161+
162+
set -o nounset
163+
set -o errexit
164+
165+
# stuff here is run outside ldmx/pro:v4.2.3
166+
# need to call `denv` to go into image
167+
denv init /full/path/to/ldmx_pro_v4.2.3.sif
168+
denv fire config.py ${1}
169+
```
170+
The `denv init` call writes a few small files which shouldn't have a large impact on performance
171+
(but could if the directory in which the job is being run has a slow filesystem).
172+
This is helpful if your configuration of HTCondor does not do the file transfer for you and
173+
your job is responsible for copying in/out any input/output files that are necessary.
174+
~~~
175+
176+
~~~admonish note title="Comments"
177+
- Similar to Slurm's `--array`, we are relying on HTCondor's `-queue` command to decide what run numbers to use. Look at HTCondor's documentation (for example [Submitting many similar jobs with one queue command](https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html#submitting-many-similar-jobs-with-one-queue-command)) for more information.
178+
~~~

0 commit comments

Comments
 (0)