|
| 1 | +# Batch Computing |
| 2 | + |
| 3 | +The academic clusters that we have access to mostly have `apptainer` installed which we can use to run the images with ldmx-sw built into them. |
| 4 | +We use `denv` when running the images manually and, fortunately, it is small enough to deploy onto the clusters as well.[^1] |
| 5 | +```shell |
| 6 | +# on the cluster you want to run batch jobs |
| 7 | +curl -s https://tomeichlersmith.github.io/denv/install | sh |
| 8 | +``` |
| 9 | + |
| 10 | +~~~admonish tip title="Image Storage" |
| 11 | +While the `${HOME}` directory is large enough to hold the installation of `denv`, |
| 12 | +they are usually much too small to hold copies of the images that we want to run. |
| 13 | +For this reason, you will likely want to edit your shell configuration (e.g. `~/.bashrc`) |
| 14 | +to change where `apptainer` will store the images. |
| 15 | +Refer to your cluster's IT help or documentation to find a suitable place to hold these images. |
| 16 | +For example, [the S3DF cluster at SLAC](https://s3df.slac.stanford.edu/#/reference?id=apptainer) |
| 17 | +suggests using the `${SCRATCH}` variable they define for their users. |
| 18 | +```shell |
| 19 | +export APPTAINER_LOCALCACHEDIR=${SCRATCH}/.apptainer |
| 20 | +export APPTAINER_CACHEDIR=${SCRATCH}/.apptainer |
| 21 | +export APPTAINER_TMPDIR=${SCRATCH}/.apptainer |
| 22 | +``` |
| 23 | +~~~ |
| 24 | + |
| 25 | +~~~admonish success title="Test" |
| 26 | +With `denv` installed on the cluster, you should be able to run `denv` like normal manually. |
| 27 | +For example, you can test run a light image that is fast to download. |
| 28 | +``` |
| 29 | +denv init alpine:latest |
| 30 | +denv cat /etc/os-release |
| 31 | +# should say "Alpine" instead of the host OS |
| 32 | +``` |
| 33 | +~~~ |
| 34 | + |
| 35 | +[^1]: The total disk footprint of a `denv` installation is 120KB. |
| 36 | +This is plenty small enough to include in your `${HOME}` directory on most if not all clusters. |
| 37 | +Additionally, most clusters share your `${HOME}` directory with the working nodes and so you don't even need to bother copying `denv` to where the jobs are being run. |
| 38 | + |
| 39 | +## Preparing for Batch Running |
| 40 | +The above instructions have you setup to run `denv` on the cluster just like you run `denv` on your own computer; however, |
| 41 | +doing a few more steps is helpful to ensure that the batch jobs run reliably and efficiently. |
| 42 | + |
| 43 | +### Pre-Building SIF Images |
| 44 | +Under-the-hood, `apptainer` runs images from SIF files. |
| 45 | +When `denv` runs using the image tage (e.g. `ldmx/pro:v4.2.3`), `apptainer` stores a copy of this image in a SIF file inside of the cache directory. |
| 46 | +While the cache directory is distributed across the worker nodes on some clusters, it is not distributed on all clusters, so pre-building the image ourselves |
| 47 | +into a known location is helpful. |
| 48 | + |
| 49 | +The location for the image should be big enough to hold the multi-GB image (so probably not your `${HOME}` directory) _and_ needs to be shared with the computers that run the jobs. |
| 50 | +Again, check with your IT or cluster documentation to see a precise location. |
| 51 | +At SLAC's S3DF, `/sdf/group/ldmx` can be a good location (and may already have the image you need built!). |
| 52 | +``` |
| 53 | +cd path/to/big/dir |
| 54 | +apptainer build ldmx_pro_v4.2.3.sif docker://ldmx/pro:v4.2.3 # just an example, name the SIF file appropriately |
| 55 | +``` |
| 56 | + |
| 57 | +### Running the SIF Image |
| 58 | +How we run the image during the jobs depends on how the jobs are configured. |
| 59 | +For the clusters I have access to (UMN and SLAC), there are two different ways for jobs to be configured |
| 60 | +that mainly change _where_ the job is run. |
| 61 | + |
| 62 | +#### Jobs Run In Submitted Directory |
| 63 | +At SLAC S3DF, the jobs submitted with `sbatch` are run from the directory where `sbatch` was run. |
| 64 | +This makes it rather easy to run jobs. |
| 65 | +We can create a denv and then submit a job running `denv` from within that directory. |
| 66 | +``` |
| 67 | +cd batch/submit/dir |
| 68 | +denv init /full/path/to/big/dir/ldmx_pro_v4.2.3.sif |
| 69 | +``` |
| 70 | +Submitting the job would look like `sbatch <job-options> submit.sh` with |
| 71 | +```shell |
| 72 | +# submit.sh |
| 73 | +denv fire config.py # inside ldmx/pro:v4.2.3 IF SUBMITTED FROM batch/submit/dir |
| 74 | +``` |
| 75 | +Look at the SLAC S3DF and Slurm documentation to learn more about configuring the batch jobs themselves. |
| 76 | + |
| 77 | +#### Jobs Run in Scratch Directory |
| 78 | +At UMN's CMS cluster, the jobs submitted with `condor_submit` are run from a newly-created scratch directory. |
| 79 | +This makes it slightly difficult to inform `denv` of the configuration we want to use. |
| 80 | +`denv` has an experimental shebang syntax that could be helpful for this purpose. |
| 81 | + |
| 82 | +```shell |
| 83 | +#!/usr/bin/env denv shebang |
| 84 | +#!denv_image=/full/path/to/ldmx_pro_v4.2.3.sif |
| 85 | +#!bash |
| 86 | + |
| 87 | +# everything here is run in `bash` inside ldmx/pro:v4.2.3 |
| 88 | +fire config.py |
| 89 | +``` |
| 90 | + |
| 91 | +And then you would `condor_submit` this script. |
| 92 | +Alternatively, one could write a script _around_ `denv` like |
| 93 | +```shell |
| 94 | +# stuff here is run outside ldmx/pro:v4.2.3 |
| 95 | +# need to call `denv` to go into image |
| 96 | +denv init /full/path/to/ldmx_pro_v4.2.3.sif |
| 97 | +denv fire config |
| 98 | +``` |
| 99 | +The `denv init` call writes a few small files which shouldn't have a large impact on performance |
| 100 | +(but could if the directory in which the job is being run has a slow filesystem). |
0 commit comments