Serial shared-memory jobs

Serial jobs are the kind of jobs, most of the people in earth observation would probably use. It means that one item (e.g. a satellite scene, a pair of satellite scenes or a stack of multiple satellite scenes) is being processed in one job at a time on one cluster node or a fraction of a node. The software that does the processing (i.e. the processor) may be a single-core or a multi-threaded (i.e. a thread-parallel) processor. Overall-parallelism and high performance can be achieved by running multiple of such jobs/tasks in parallel on an HPC-cluster. As the RAM (memory) of the cluster nodes is shared among these and other jobs running on the cluster, they are called shared-memory jobs. A single shared-memory job is limited to the total amount of memory, cores/hyperthreads and - if available - GPUs of one node.

General Instructions

This is a simple step-by-step recipe for the simplest type of serial shared-memory jobs, illustrating the use of the SLURM commands for users of a bash shell. For a general introduction to SLURM and links to a very detaild documentation, see SLURM Workload Manager. Concrete examples of serial shared-memory job scripts can be found in this chapter.

Resource allocation

Specify the amount of required resources wisely! Don't block resources by demanding resources (e.g. amount of cores, RAM, processing time), you don't need. As SLURM takes care of a fair share of resources, small jobs (i.e. jobs that request a small amount of resources for short times) are preferred over large jobs (i.e. jobs that request a large amount of resources for long times) for scheduling in the queue.

Submission of a job

Step 2: Edit a job script

The following script is assumed to be stored in the file myjob.cmd. For a more complete list of possible SLURM-script variables see SLURM Workload Manager.

#!/bin/bash
#SBATCH -J <job_name>	(Placeholder) name of job (not more than 10 characters please)
#SBATCH -o ./%x.%j.%N.out	(Placeholder) standard output goes there. Note that the directory where the output file is placed must exist before the job starts, and the full path name must be specified (no environment variable!). You can choose any name you like. If you prefer automatic naming, the %x encodes the job name into the output file. The %j encodes the job ID into the output file. The %N encodes the master node of the job and can be added if job IDs from different SLURM clusters might be the same. Here, the specified path is relative to the directory specified in the -D spec.
#SBATCH -e ./%x.%j.%N.err	(Placeholder) standard error goes there. Note that the directory where the output file is placed must exist before the job starts, and the full path name must be specified (no environment variable!). You can choose any name you like. If you prefer automatic naming, the %x encodes the job name into the output file. The %j encodes the job ID into the output file. The %N encodes the master node of the job and can be added if job IDs from different SLURM clusters might be the same. Here, the specified path is relative to the directory specified in the -D spec.
#SBATCH -D ./	directory used by script as starting point (working directory). The directory specified must exist. Here, the path is relative to the submission directory.
#SBATCH --get-user-env	Set user environment properly
#SBATCH --clusters=hpda2 #SBATCH --partition=<cluster partition>	Define cluster and partition to run the job on. See Available HPC-cluster partitions.
#SBATCH --cpus-per-task= <number of hyperthreads>	Number of hyperthreads per job. The minimum is --cpus-per-task=2 (i.e. 2 hyperthreads ≡ 1 physical core), the maximum is --cpus-per-task=160 (i.e. 160 hyperthreads ≡ 80 physical cores ≡ 1 full node). The setting for a Single-core job is --cpus-per-task=2.
#SBATCH --gres=gpu: <number of GPUs>	Only available on hpda2_compute_gpu partition. Number of GPUS per job.
#SBATCH --mem=<e.g. 500mb, 50gb,...>	Specify maximum memory the job can use. Very large values can cause assignment of additional cores to the job that remain unused, so this feature should be used with care.
#SBATCH --mail-type=end	Send an e-mail at job completion
#SBATCH --mail-user=<email_address>@<domain>	(Placeholder) e-mail address (don't forget, and please enter a valid address!)
#SBATCH --export=NONE	Do not export the environment of the submitting shell into the job; while SLURM allows to also use ALL here, this is strongly discouraged, because the submission environment is very likely to be inconsistent with the environment required for execution of the job.
#SBATCH --time=08:00:00	Maximum run time is 8 hours 0 minutes 0 seconds in this example. The limit may be increased up to the queue limit.
#SBATCH --account= <your account>	If you belong to a terrabyte project, you must specify your group account (which is identical to the project id), in order to benefit from a higher priority in the SLURM-queue: e.g. --account=pn56su. DLR users who got their account through the self-registration portal are in the group hpda-c by default (lowest priority class), but may specify a project id with a higher priority class here as well.
module load slurm_setup	First executed line: SLURM settings necessary for proper setup of batch environment.
module load ...	Load any required environment modules (usually needed if program is linked against shared libraries, or if paths to applications are needed). The "..." is of course a placeholder.
./my_serial_prog.exe python /path/to/python/script.py ch-run -b /dss/.:/dss/ -w /path/to/container/image -- python /path/to/python/script.py	Start executable, script or a script in a software container, see examples

This script essentially looks like a bash script. However, there are specially marked comment lines ("control sequences"), which have a special meaning in the SLURM context. The entries marked "Placeholder" must be suitably modified to have valid user-specific values.

For this script, the environment of the submitting shell will not be exported to the job's environment. The latter is completely set up via the module system inside the script.

Step 3: Submission procedure

The job script is submitted to the queue via the command

sbatch path/to/myjob.cmd

At submission time the control sequences are evaluated and stored in the queuing database, and the script is copied into an internal directory for later execution. If the command was executed successfully, the Job ID will be returned as follows:

Submitted batch job 65648.

It is a good idea to note down your Job IDs, for example to provide to the LRZ Support as information if anything goes wrong. The submission command can also contain control sequences. For example,

sbatch --time=12:00:00 myjob.cmd

would override the setting inside the script, forcing it to run 12 instead of 8 hours.

Step 4: Checking the status of a job

Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. The status of the current jobs of a user can be queried with the squeue --clusters=hpda2 --user=<your user ID> command.The output may look similar to this:

CLUSTER: hpda2

** JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)**

** 100120 hpda2_com D027_01 di82rij R 1-03:19:49 1 hpdar03c02s07**

The state "P" indicates that the job is queued (pending for execution). Once the job is running, the output would indicate the state to be "R" (=running), and would also list the host(s) it was running on. For jobs that have not yet started, the --start option, applied to squeue, will provide an estimate (!) for the starting time. The sinfo --clusters=[all | cluster_name] command prints out an overview of the status of all clusters or a particular clusters in the SLURM configuration.

Inspection and modification of jobs

Queued jobs can be inspected for their characteristics via the command

scontrol --clusters=<cluster_name> show jobid=<job ID>

which will print out a list of "Keyword=Value" pairs which characterize the job. As long as a job is waiting in the queue, it is possible to modify at least some of these; for example, the command

scontrol --clusters=<cluster_name> update jobid=65648 TimeLimit=04:00:00

would change the run time limit of the above-mentioned example job from 8 hours to 4 hours.

Deleting jobs from the queue

To forcibly remove a job from SLURM, the command

scancel --clusters=<cluster_name> <JOB_ID>

can be used. Please do not forget to specify the cluster! The scancel (1) man page provides further information on the use of this command.

Example serial shared-memory job scripts

Single core jobs

This job type normally uses a single core on a shared memory node of the designated SLURM partition. The multiple cores of a node are shared with other users.

Simple single core job Single core job using a container

Simple single core job	Single core job using a container
`#!/bin/bash` `#SBATCH -J example_jobname` `#SBATCH -o /example/path/to/stdout.logfile` `#SBATCH -e /example/path/to/stderr.logfile` `#SBATCH -D ./` `#SBATCH --get-user-env` `#SBATCH --clusters=hpda2` `#SBATCH --partition=hpda2_compute` `#SBATCH --cpus-per-task=2` `#SBATCH --mem=40gb` `#SBATCH --mail-type=all` `#SBATCH --mail-user=example.user@dlr.de` `#SBATCH --export=NONE` `#SBATCH --time=01:00:00` `#SBATCH --account=hpda-c` `module load slurm_setup` `./myprog.exe` The above example requests 1 physical core (i.e. 2 hyperthreads) and 40 GB RAM of a node for 1 hour and assumes the binary is located in the submission directory (-D). The user gets notified by mail in all cases: the job begins, ends or fails.	`#!/bin/bash` `#SBATCH -J example_jobname` `#SBATCH -o /example/path/to/stdout.logfile` `#SBATCH -e /example/path/to/stderr.logfile` `#SBATCH -D ./` `#SBATCH --get-user-env` `#SBATCH --clusters=hpda2` `#SBATCH --partition=hpda2_compute` `#SBATCH --cpus-per-task=2` `#SBATCH --mem=500mb` `#SBATCH --mail-type=end` `#SBATCH --mail-user=example.user@dlr.de` `#SBATCH --export=NONE` `#SBATCH --time=00:10:00` `#SBATCH --account=hpda-c` `module load slurm_setup` `module load charliecloud` `ch-run -b /dss/.:/dss/ -w /path/to/container/image -- python /path/to/python/script.py` The above example requests 1 physical core (i.e. 2 hyperthreads) and 500 MB RAM of a node for 10 minutes and executes a python script in a Charliecloud container that has the DSS mounted under /dss. The user gets notified by mail when the job ended.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/stderr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --cpus-per-task=2

#SBATCH --mem=40gb

#SBATCH --mail-type=all

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=01:00:00

#SBATCH --account=hpda-c

module load slurm_setup

./myprog.exe

The above example requests 1 physical core (i.e. 2 hyperthreads) and 40 GB RAM of a node for 1 hour and assumes the binary is located in the submission directory (-D). The user gets notified by mail in all cases: the job begins, ends or fails.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/stderr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --cpus-per-task=2

#SBATCH --mem=500mb

#SBATCH --mail-type=end

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=00:10:00

#SBATCH --account=hpda-c

module load slurm_setup

module load charliecloud

ch-run -b /dss/.:/dss/ -w /path/to/container/image -- python /path/to/python/script.py

The above example requests 1 physical core (i.e. 2 hyperthreads) and 500 MB RAM of a node for 10 minutes and executes a python script in a Charliecloud container that has the DSS mounted under /dss. The user gets notified by mail when the job ended.

Multi-threaded (thread-parallel) jobs

This job type uses a single shared memory node of the designated SLURM partition. Parallelization can be achieved either via custom thread-programming (e.g. in a python or bash script) or OpenMP programming.

Please note that these scripts are usually not useful for MPI applications; scripts for such programs are given in the section Parallel distributed memory jobs (MPI).

Simple thread-parallel job	Thread-parallel job using a container	Thread-parallel job using OpenMP
#!/bin/bash #SBATCH -J example_jobname #SBATCH -o /example/path/to/stdout.logfile #SBATCH -e /example/path/to/sterr.logfile #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute #SBATCH --cpus-per-task=40 #SBATCH --mem=40gb #SBATCH --mail-type=end #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=01:00:00 #SBATCH --account=hpda-c module load slurm_setup ./myprog.exe The above example requests 40 hyperthreads (i.e. 20 physical cores) and 40 GB RAM of a node for 1 hour and assumes the binary is located in the submission directory (-D). The user gets notified by mail when the job ended.	#!/bin/bash #SBATCH -J example_jobname #SBATCH -o /example/path/to/stdout.logfile #SBATCH -e /example/path/to/sterr.logfile #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute #SBATCH --cpus-per-task=16 #SBATCH --mem=50gb #SBATCH --mail-type=fail #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=10:00:00 #SBATCH --account=hpda-c module load slurm_setup module load charliecloud ch-run -b /dss/.:/dss/ -w /path/to/container/image -- bash /path/to/bash/multi_core_script.sh The above example requests 16 hyperthreads (i.e. 8 physical cores ) and 50 GB RAM on a node for 10 hours and executes a bash script in a Charliecloud container that has the DSS mounted under /dss. The user gets only notified by mail if the job fails.	#!/bin/bash #SBATCH -J example_jobname #SBATCH -o /example/path/to/stdout.logfile #SBATCH -e /example/path/to/sterr.logfile #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute #SBATCH --cpus-per-task=16 #SBATCH --mem=10gb #SBATCH --mail-type=all #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=00:10:00 #SBATCH --account=hpda-c module load slurm_setup export OMP_NUM_THREADS=8 #use 8 physical cores ./my_openmp_program.exe The above example requests 16 hyperthreads (i.e. 8 physical cores ) and 10 GB RAM of a node for 10 minutes and assumes the binary is located in the submission directory (-D). The user gets notified by mail in all cases: the job begins, ends or fails.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/sterr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --cpus-per-task=40

#SBATCH --mem=40gb

#SBATCH --mail-type=end

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=01:00:00

#SBATCH --account=hpda-c

module load slurm_setup

./myprog.exe

The above example requests 40 hyperthreads (i.e. 20 physical cores) and 40 GB RAM of a node for 1 hour and assumes the binary is located in the submission directory (-D). The user gets notified by mail when the job ended.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/sterr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --cpus-per-task=16

#SBATCH --mem=50gb

#SBATCH --mail-type=fail

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=10:00:00

#SBATCH --account=hpda-c

module load slurm_setup

module load charliecloud

ch-run -b /dss/.:/dss/ -w /path/to/container/image -- bash /path/to/bash/multi_core_script.sh

The above example requests 16 hyperthreads (i.e. 8 physical cores ) and 50 GB RAM on a node for 10 hours and executes a bash script in a Charliecloud container that has the DSS mounted under /dss. The user gets only notified by mail if the job fails.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/sterr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --cpus-per-task=16

#SBATCH --mem=10gb

#SBATCH --mail-type=all

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=00:10:00

#SBATCH --account=hpda-c

module load slurm_setup

export OMP_NUM_THREADS=8 #use 8 physical cores

./my_openmp_program.exe

The above example requests 16 hyperthreads (i.e. 8 physical cores ) and 10 GB RAM of a node for 10 minutes and assumes the binary is located in the submission directory (-D). The user gets notified by mail in all cases: the job begins, ends or fails.

Serial jobs on the GPU-cluster

Here we show examples for jobs on the GPU-cluster. The general implications of the CPU-examples above are also applicable.

Simple 1 GPU-job	1 GPU job using a conda environment	Multi-GPU job using a container
#!/bin/bash #SBATCH -J example_jobname #SBATCH -o /example/path/to/stdout.logfile #SBATCH -e /example/path/to/sterr.logfile #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute_gpu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --mail-type=end #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=01:00:00 #SBATCH --account=hpda-c module load slurm_setup module load python/3.7.11-extended python /path/to/mypythonscript.py The above example requests 1 GPU with 80 GB RAM on 1 node for 1 hour, loads a python module and and executes a python script. The user gets notified by mail when the job ended.	#!/bin/bash #SBATCH -J example_jobname #SBATCH -o /example/path/to/stdout.logfile #SBATCH -e /example/path/to/sterr.logfile #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute_gpu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=12 #SBATCH --mem=50gb #SBATCH --mail-type=fail #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=10:00:00 #SBATCH --account=hpda-c module load slurm_setup module load miniconda3 eval "$(conda shell.bash hook)" conda activate myenv python /path/to/mypythonscript.py The above example requests 1 GPU with 80 GB RAM on 1 node for 10 hours, as well as 6 physical CPU-cores (i.e. 12 hyperthreads) with 50 GB RAM (on the CPU-part of the node). It loads a custom conda environment and and executes a python script within this environment. The user gets notified by mail if the job fails.	#!/bin/bash #SBATCH -J example_jobname #SBATCH -o /example/path/to/stdout.logfile #SBATCH -e /example/path/to/sterr.logfile #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute_gpu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:2 #SBATCH --mail-type=fail #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=05:00:00 #SBATCH --account=hpda-c module load slurm_setup module load charliecloud ch-run -b /dss/.:/dss/ -w /path/to/container/image -- python /path/to/mypythonscript.py The above example requests a total of 2 GPUs with 80 GB RAM on 1 nodes for 1 hour and executes a python script in a Charliecloud container that has the DSS mounted under /dss. The user gets notified by mail only if the job fails.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/sterr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute_gpu

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --gres=gpu:1

#SBATCH --mail-type=end

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=01:00:00

#SBATCH --account=hpda-c

module load slurm_setup

module load python/3.7.11-extended

python /path/to/mypythonscript.py

The above example requests 1 GPU with 80 GB RAM on 1 node for 1 hour, loads a python module and and executes a python script. The user gets notified by mail when the job ended.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/sterr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute_gpu

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --gres=gpu:1

#SBATCH --cpus-per-task=12

#SBATCH --mem=50gb

#SBATCH --mail-type=fail

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=10:00:00

#SBATCH --account=hpda-c

module load slurm_setup

module load miniconda3

eval "$(conda shell.bash hook)"

conda activate myenv

python /path/to/mypythonscript.py

The above example requests 1 GPU with 80 GB RAM on 1 node for 10 hours, as well as 6 physical CPU-cores (i.e. 12 hyperthreads) with 50 GB RAM (on the CPU-part of the node). It loads a custom conda environment and and executes a python script within this environment. The user gets notified by mail if the job fails.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/sterr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute_gpu

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --gres=gpu:2

#SBATCH --mail-type=fail

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=05:00:00

#SBATCH --account=hpda-c

module load slurm_setup

module load charliecloud

ch-run -b /dss/.:/dss/ -w /path/to/container/image -- python /path/to/mypythonscript.py

The above example requests a total of 2 GPUs with 80 GB RAM on 1 nodes for 1 hour and executes a python script in a Charliecloud container that has the DSS mounted under /dss. The user gets notified by mail only if the job fails.

Run multiple serial jobs/tasks in parallel

In case of serial shared memory jobs, parallelism can be achieved by

a. multi threading (i.e. by running a multi-threaded program/script in one single job, see example above,

b. job farming (run multiple serial tasks/jobs in parallel in one SLURM job,

c. job arrays (submit multiple SLURM-jobs to the scheduler with a single job script and let them run in parallel) or

d. a scheduler script that sequentially generates a job script for every task/job and submits it to SLURM.

Job Farming (starting multiple serial tasks/jobs in one SLURM-Job)

The example job scripts illustrate how to start up multiple serial tasks/jobs within a shared memory parallel SLURM script, with each task/job taking different input data (specified by the iterator $i).

Warning

Please use this with care! If the serial tasks/jobs are imbalanced with respect to run time, this usage pattern can waste CPU resources. Strongly unbalanced jobs may be removed forcibly.

Be aware, that since a lot of tasks get bundled in a single job, the resources requested by the job at a time may be very large. Hence, depending on the number of requested nodes/resources and the load on the cluster, it may take some time until your job gets executed. Please also consider that there is only one logfile (one for stdout and one for stderr) that contains the output of all parallel processes. Logfiles may therefore get very large and it may be somewhat tricky to attribute failures to single tasks/jobs if something goes wrong. Depending on your use case and the load on the cluster, it may be better to run a seperate (small) job for each task in parallel (see examples below).

Simple Job Farming Job Farming using a container

Simple Job Farming	Job Farming using a container
#!/bin/bash #SBATCH -J example_jobname #SBATCH -o ./%x.%j.%N.out #SBATCH -e ./%x.%j.%N.err #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute #SBATCH --nodes=2 #SBATCH --ntasks-per-node=40 #SBATCH --cpus-per-task=4 #SBATCH --mail-type=end #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=01:00:00 #SBATCH --account=hpda-c module load slurm_setup MYPROG=path/to/myprog.exe # Start as many background serial jobs as the total number of tasks per job for ((i=1; i<=$SLURM_NTASKS; i++)); do `$MYPROG input_${i} &` done wait # for completion of background tasks The above example requests 2 nodes for 1 hour and assumes the binary is located in the submission directory (-D). It runs 80 (2 * 40) tasks in parallel on the two nodes. Each task gets 4 hyperthreads (i.e. 2 physical cores). A task consitst of running myprog.exe with an input that varies with $i. The user gets notified by mail when the job ended.	#!/bin/bash #SBATCH -J example_jobname #SBATCH -o ./%x.%j.%N.out #SBATCH -e ./%x.%j.%N.err #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute #SBATCH --nodes=2 #SBATCH --ntasks-per-node=40 #SBATCH --cpus-per-task=4 #SBATCH --mail-type=end #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=01:00:00 #SBATCH --account=hpda-c module load slurm_setup module load charliecloud # Start as many background serial jobs as there are cores available on the node for ((i=1; i<=$SLURM_NTASKS; i++)); do ch-run -b /dss/.:/dss/ -w /path/to/container/image -- bash /path/to/bash/myscript.sh input_$i & done wait # for completion of background tasks The example is the same as on the left, but here a task consists of running a bash script in a Charliecloud container that has the DSS mounted under /dss.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o ./%x.%j.%N.out

#SBATCH -e ./%x.%j.%N.err

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --nodes=2

#SBATCH --ntasks-per-node=40

#SBATCH --cpus-per-task=4

#SBATCH --mail-type=end

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=01:00:00

#SBATCH --account=hpda-c

module load slurm_setup

MYPROG=path/to/myprog.exe

# Start as many background serial jobs as the total number of tasks per job

for ((i=1; i<=$SLURM_NTASKS; i++)); do

$MYPROG input_${i} &

done

wait # for completion of background tasks

The above example requests 2 nodes for 1 hour and assumes the binary is located in the submission directory (-D). It runs 80 (2 * 40) tasks in parallel on the two nodes. Each task gets 4 hyperthreads (i.e. 2 physical cores). A task consitst of running myprog.exe with an input that varies with $i. The user gets notified by mail when the job ended.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o ./%x.%j.%N.out

#SBATCH -e ./%x.%j.%N.err

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --nodes=2

#SBATCH --ntasks-per-node=40

#SBATCH --cpus-per-task=4

#SBATCH --mail-type=end

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=01:00:00

#SBATCH --account=hpda-c

module load slurm_setup

module load charliecloud

# Start as many background serial jobs as there are cores available on the node

for ((i=1; i<=$SLURM_NTASKS; i++)); do

ch-run -b /dss/.:/dss/ -w /path/to/container/image -- bash /path/to/bash/myscript.sh input_$i &

done

wait # for completion of background tasks

The example is the same as on the left, but here a task consists of running a bash script in a Charliecloud container that has the DSS mounted under /dss.

Job Arrays (start a SLURM job for every task/job at a time and run them in parallel)

Job arrays offer a mechanism for submitting and managing collections of similar tasks/jobs quickly and easily. An example may be the application of a processor with the same configurations on a huge set of satellite scenes. Job arrays with many tasks/jobs can be submitted in milliseconds. All jobs must have the same initial options (e.g. size, time limit, etc.). Job arrays will have additional environment variables set. The example job scripts illustrate how to start up multiple serial tasks/jobs within a shared memory parallel SLURM script, with each task/job taking different input data (specified by the iterator $SLURM_ARRAY_TASK_ID). Note that in case of job-arrays, you can optionally use "%A" and "%a" in the naming of the standard output- and standard error-logs, in order to include the job ID (%A) and the array index (%a, same as $SLURM_ARRAY_TASK_ID).

Resource allocation

The maxiumum number of jobs in the queue per a user is 10.000. The maximum size of a job array is 1.000. Each task of a job array counts as one job even though they will not occupy separate job records until modified or initiated. If you need more jobs to be executed at a time, have a look at Job Farming (see above). Please be aware that in case of very high numbers of jobs, small errors, that would normally pass unnoticed (the cluster can deal with them), are multiplied by the number of jobs, which may end up in a big error that messes up the cluster's performance. So it is a good idea to submit smaller jobs first in order to check if everything works as expeted.

Simple Job Array Job Array using a container

Simple Job Array	Job Array using a container
#!/bin/bash #SBATCH -J example_jobname #SBATCH -o ./%A_%a.out #SBATCH -e ./%A_%a.err #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute #SBATCH --nodes=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=25gb #SBATCH --array=1-80 #SBATCH --mail-type=end #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=01:00:00 #SBATCH --account=hpda-c module load slurm_setup MYPROG=path/to/myprog.exe `$MYPROG input_${SLURM_ARRAY_TASK_ID}` The above example instantly submits 80 jobs to the SLURM scheduler. Each job has a walltime of 1 hour, gets 4 hyperthreads (i.e. 2 physical cores) and 25 GB RAM. A task consitst of running myprog.exe with an input that varies with SLURM_ARRAY_TASK_ID (a number from 1 - 80). The user gets notified by mail when the job ended.	#!/bin/bash #SBATCH -J example_jobname #SBATCH -o ./%A_%a.out #SBATCH -e ./%A_%a.err #SBATCH -D ./ #SBATCH --get-user-env #SBATCH --clusters=hpda2 #SBATCH --partition=hpda2_compute #SBATCH --nodes=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=25gb #SBATCH --array=1-80 #SBATCH --mail-type=end #SBATCH --mail-user=example.user@dlr.de #SBATCH --export=NONE #SBATCH --time=01:00:00 #SBATCH --account=hpda-c module load slurm_setup module load charliecloud `ch-run -b /dss/.:/dss/ -w /path/to/container/image -- bash /path/to/bash/myscript.sh input_${SLURM_ARRAY_TASK_ID}` The example is the same as on the left, but here a task consists of running a bash script in a Charliecloud container that has the DSS mounted under /dss.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o ./%A_%a.out

#SBATCH -e ./%A_%a.err

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --nodes=1

#SBATCH --cpus-per-task=4

#SBATCH --mem=25gb

#SBATCH --array=1-80

#SBATCH --mail-type=end

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=01:00:00

#SBATCH --account=hpda-c

module load slurm_setup

MYPROG=path/to/myprog.exe

$MYPROG input_${SLURM_ARRAY_TASK_ID}

The above example instantly submits 80 jobs to the SLURM scheduler. Each job has a walltime of 1 hour, gets 4 hyperthreads (i.e. 2 physical cores) and 25 GB RAM. A task consitst of running myprog.exe with an input that varies with SLURM_ARRAY_TASK_ID (a number from 1 - 80). The user gets notified by mail when the job ended.

#!/bin/bash

#SBATCH -J example_jobname

#SBATCH -o ./%A_%a.out

#SBATCH -e ./%A_%a.err

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --clusters=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --nodes=1

#SBATCH --cpus-per-task=4

#SBATCH --mem=25gb

#SBATCH --array=1-80

#SBATCH --mail-type=end

#SBATCH --mail-user=example.user@dlr.de

#SBATCH --export=NONE

#SBATCH --time=01:00:00

#SBATCH --account=hpda-c

module load slurm_setup

module load charliecloud

ch-run -b /dss/.:/dss/ -w /path/to/container/image -- bash /path/to/bash/myscript.sh input_${SLURM_ARRAY_TASK_ID}

The example is the same as on the left, but here a task consists of running a bash script in a Charliecloud container that has the DSS mounted under /dss.

Scheduling scripts for sequential job submission

The example scripts show one possible way of how to sequentially submit jobs to the SLURM-queue and how to achive paralllelism on two levels: the job level (i.e. submit and run many SLURM-jobs in parallel) and the task level (run many tasks in parallel in a SLURM-job). In the example, a list of zip-files is taken as an input and the task is to unzip each file. The scheduling script allows to define the number of files to be unzipped in one job (n_per_job=...) and how many of these files shall be unzipped in parallel (parallel_runs=...). The script also takes care of keeping the number of queued jobs on the cluster within a predefined limit and creates a jobfolder for each unzipping job where all processing logs and the SLURM-scrips will be put. The unzip-script not just unzips the files in parallel, it also checks whether unzipping went well, and writes sucessful as well as failed files onto seperate lists.

There may be an infinite number of ways of how to organize and code workflows on HPC. This is just an appetizer. You can modify the example according to your needs, combine it with the other solutions presented above or find completely different, better and more sophisticated solutions that fit to your personal needs and preferences.

schedule_unzipping.sh

#!/bin/bash

# Folders and paths
#########################

script_folder=/dss/dsshome1/lxc0F/di46riq/Scripts/unzip_Sentinel 
# folder where program scripts are located

job_dir=/dss/dssfs02/pn56su/pn56su-dss-0007/zip_example/SLURM/unzip 
# folder to which job files will be written

input_folder=/dss/dssfs02/pn56su/pn56su-dss-0007/zip_example
# main folder on DSS where Sentinel data are located

input_list=/dss/dssfs02/pn56su/pn56su-dss-0007/zip_example/index.csv
# path to list with paths of input zip files relative to $input_folder

# Processing variables
###########################

# general settings
n_per_job=10
# number of Sentinel zip files to unzip in one job

# SLURM-Settings
cluster="hpda2"
# HPC cluster on which to run program (hpda2, cm2_tiny or serial)

cpus=2
# number of Hyperthreads to be used per job

parallel_runs=1
# number of runs to be executed in parallel
# (1 = no parallelism, single-core jobs shall be executed on cluster serial!)

walltime="00:30:00"
# walltime of one job (hh:mm:ss)

mail_adr="mymailadress@dlr.de"
# mail adress where to send notifications to

mail_type="all"
# Batch system sends e-mail when [starting|ending|aborting|requeuing] job.
# Options: [begin|end|fail|requeue|all|none]

account="hpda-c"

################################################################################################################################################
################################################################################################################################################
# get partition of cluster
# define maximum number of submitted jobs in queue depending on selected cluster.
if [[ $cluster = "cm2_tiny" ]];
then
    partition="cm2_tiny"
    max_queue=50
elif [[ $cluster -eq "serial" ]];
then
    partition="serial_std"
    max_queue=250
elif [[ $cluster -eq "hpda2" ]];
then
    partition="hpda2_compute"
    max_queue=5000
fi

# get number of .SAFE files to process
n_to_process=`wc -l ${input_list} | awk '{ print $1 }'`

if [ $n_to_process -eq "0" ];
then
    echo "################## List does not contain Sentinel-2 .SAFE files! Check list! ##################"
else
    echo "################## Start job-scheduling for $n_to_process Sentinel-2 .SAFE files #########################"

timestamp=$(date +%s)
# get time stamp

mkdir -p $job_dir # make folder where temp. job files will be stored if not exists
jobbundle=${job_dir}/${timestamp} # make jobbundle folder where singlejob-files will be stored

mkdir -p $jobbundle

# get number of jobs to submit
n_jobs_tmp=$(($n_to_process/$n_per_job))
n_to_process_tmp=$(($n_jobs_tmp\*$n_per_job))
if [ $n_to_process_tmp -lt $n_to_process ];
then
    n_jobs=$(($n_jobs_tmp+1))
else
    n_jobs=$n_jobs_tmp
fi

echo "########## $n_jobs jobs will be submitted to the $cluster cluster ##########"

u=1
start=1
end=$n_per_job

# loop over jobs
while [ $u -le $n_jobs ];
do
    # keep number of jobs in queue within cluster limit
    n_queue=$(squeue --clusters=${cluster} | wc -l)
    n_queue=$(expr $n_queue - 2)

    while [ $n_queue -eq $max_queue ];
    do
        sleep 5s
        n_queue=$(squeue --clusters=${cluster} | wc -l)
        n_queue=$(expr $n_queue - 2)
    done

    # make number with leading zeros
    u_long=$(printf "%05d" $u)

    #generate job name
    jobname=${timestamp}_${u_long} #

    # create folder for single job and enter it
    jobfolder=${jobbundle}/${jobname}
    mkdir -p $jobfolder
    cd $jobfolder

    # make sub-list of files to process within job
    if [ $u -eq $n_jobs ];
    then
        start_old=$start
        end_old=$n_to_process
    else
        start_old=$start
        end_old=$end

        start=$(($start+$n_per_job))
        end=$(($end+$n_per_job))
    fi
    sed -n "$start_old,$end_old p" ${input_list} &gt; ${jobname}.csv

    # create SLURM-Jobfile
    echo "################# Create Job ${jobname} #####################"
    echo "#!/bin/bash" | tee -a ${jobname}.cmd
    echo "#SBATCH -J ${jobname}" | tee -a ${jobname}.cmd
    echo "#SBATCH -o ${jobfolder}/${jobname}.out" | tee -a ${jobname}.cmd
    echo "#SBATCH -e ${jobfolder}/${jobname}.err" | tee -a ${jobname}.cmd
    echo "#SBATCH -D ./" | tee -a ${jobname}.cmd
    echo "#SBATCH --get-user-env" | tee -a ${jobname}.cmd
    echo "#SBATCH --clusters=${cluster}" | tee -a ${jobname}.cmd
    echo "#SBATCH --partition=${partition}" | tee -a ${jobname}.cmd
    echo "#SBATCH --cpus-per-task=${cpus}" | tee -a ${jobname}.cmd
    echo "#SBATCH --nodes=1" | tee -a ${jobname}.cmd
    echo "#SBATCH --mail-type=${mail_type}" | tee -a ${jobname}.cmd
    echo "#SBATCH --mail-user=${mail_adr}" | tee -a ${jobname}.cmd
    echo "#SBATCH --export=NONE" | tee -a ${jobname}.cmd
    echo "#SBATCH --time=${walltime}" | tee -a ${jobname}.cmd
    echo "#SBATCH --account=${account}" | tee -a ${jobname}.cmd
    echo "module load slurm_setup" | tee -a ${jobname}.cmd

    # execute processing script with serveral input variables
    echo "bash ${script_folder}/unzip.sh $jobfolder/${jobname}.csv $input_folder $jobbundle $jobfolder $script_folder $cluster $timestamp $parallel_runs" | tee -a ${jobname}.cmd

    # submit jobfile
    sbatch ${jobname}.cmd

    u=$(($u+1))
    sleep 2s
    done
fi

unzip.sh

#!/bin/bash

##########################################################################
##########################################################################
list=$1 # input list of Sentinel-2 L2A files to process in this job
input_folder=$2
jobbundle=$3 #jobbundle folder where single job-files are stored
jobfolder=$4
scriptfolder=$5
cluster=$6
timestamp=$7
parallel_runs=$8
##########################################################################

num_jobs="\j" # The prompt escape for number of jobs currently running

ln=`wc -l &lt; ${list}` # get number of items in the list

# define task for parallel processing
task () {
    # get path to Sentinel file from list
    local m=$1
    file=`sed -n "${m}p" ${list}`
    name=$(basename $file .zip)
    path=$(dirname $file)
    cd ${input_folder}${path#.}

    # unzip file and read error-code
    unzip ${name}.zip
    error_code=$(echo $?)
    
    if [[ $error_code != "0" ]];
    then
        # if unzip failed
        rm -r ${name}

        echo "${file};Error-Code: ${error_code}" &gt;&gt; ${input_folder}/Sentinel_unzip_failed.csv
        # write path of Sentinel zip file to failed list
    else
        # write path of Sentinel file to finished list and index
        echo $file &gt;&gt; ${input_folder}/Sentinel_unzip_finished.csv

        rm ${name}.zip

    fi
}

# run task in parallel and keep number of parallel processes in the limit of max. allowed parallel processes
for m in $(eval echo "{1..$ln}");
do
    while (( ${num_jobs@P} &gt;= parallel_runs ));
    do
        wait -n
    done

    task "$m" &
done

#wait for all parallel processes to finish, before to proceed
wait

ln_err=`wc -l &lt; ${jobfolder}/\*.err` # read number of lines in .err file

#if there has no error occured during processing, delete jobfolder
if [ $ln_err -eq 0 ];
then
    rm -r $jobfolder

    #count jobs in queue
    n_queue=$(squeue --clusters=${cluster} | grep ${timestamp} | wc -l)

    #count folders in jobbundle folder
    jobbundle=$(dirname $jobfolder)
    n_folders=$(ls -d ${jobbundle}/\* | wc -l)

    #if all jobs run without error and if this is the last job in queue, delete entire jobbundle folder
    if [ $n_folders -eq 0 -a $n_queue -eq 1 ];
    then
        rm -r $jobbundle
    fi
fi

General Instructions​

Submission of a job​

Step 1: Log into login node​

Step 2: Edit a job script​

Step 3: Submission procedure​

Step 4: Checking the status of a job​

Inspection and modification of jobs​

Deleting jobs from the queue​

Example serial shared-memory job scripts​

Single core jobs​

Multi-threaded (thread-parallel) jobs​

Serial jobs on the GPU-cluster​

Run multiple serial jobs/tasks in parallel​

Job Farming (starting multiple serial tasks/jobs in one SLURM-Job)​

Job Arrays (start a SLURM job for every task/job at a time and run them in parallel)​

Scheduling scripts for sequential job submission​

General Instructions

Submission of a job

Step 1: Log into login node

Step 2: Edit a job script

Step 3: Submission procedure

Step 4: Checking the status of a job

Inspection and modification of jobs

Deleting jobs from the queue

Example serial shared-memory job scripts

Single core jobs

Multi-threaded (thread-parallel) jobs

Serial jobs on the GPU-cluster

Run multiple serial jobs/tasks in parallel

Job Farming (starting multiple serial tasks/jobs in one SLURM-Job)

Job Arrays (start a SLURM job for every task/job at a time and run them in parallel)

Scheduling scripts for sequential job submission