Skip to main content

Parallel distributed-memory jobs (MPI)

In distributed-memory parallelism (DM), an application achieves parallelism by running multiple instances of itself across multiple nodes to solve the problem. Each instance is allocated its own chunk of virtual memory and communicates to other instances via a message passing interface such as MPI. DM type jobs have many advantages, among which are more cores to tackle the problem and a far greater amount of memory in contrast to serial shared memory jobs that are limited to the resources of one node. However, using MPI can get quite tricky and has its origin in mathematical scientific computing/modelling. It traditionally required portable message-passing programs written in C, C++ or Fortran, however using MPI with python is also possible today. Nevertheless, MPI is probably not what an average remote-sensing scientist or geographer is into or which fits to his/her workflows. So if you feel more comfortable with common EO-processing using scripting languages, don't worry, serial shared memory jobs or xarray/Dask are powerful tools for you to boost your processing performance! If you are a pro or if you are already familiar with MPI, you will be more than happy trying and using distributed memory parallel jobs with MPI or MPI-OpenMP hybrid jobs on the HPC.

General Instructions

This is a simple step-by-step recipe for the simplest type of parallel distributed-memory jobs, illustrating the use of the SLURM commands for users of a bash shell. For a general introduction to SLURM and links to a very detailed documentation, see SLURM Workload Manager. Concrete examples of parallel distributed-memory job scripts can be found in the examples-section. More information on MPI and OpenMP-programming can be found here: MPI, OpenMP.

Resource allocation

Specify the amount of required resources wisely! Don't block resources by demanding resources (e.g. amount of cores, RAM, processing time), you don't need. As SLURM takes care of a fair share of resources, small jobs (i.e. jobs that request a small amount of resources for short times) are preferred over large jobs (i.e. jobs that request a large amount of resources for long times) for scheduling in the queue. 

Submission of a job

Step 1: Login to the login node

Step 2: Edit a job script

The following script is assumed to be stored in the file myjob.cmd. For a more complete list of possibel SLURM-script variables see SLURM Workload Manager.

#!/bin/bash
#SBATCH -J <job_name>(Placeholder) name of job (not more than 10 characters please)
#SBATCH -o ./%x.%j.%N.out(Placeholder) standard output goes there. Note that the directory where the output file is placed must exist before the job starts, and the full path name must be specified (no environment variable!). You can choose any name you like. If you prefer automatic naming, the %x encodes the job name into the output file. The %j encodes the job ID into the output file. The %N encodes the master node of the job and can be added if job IDs from different SLURM clusters might be the same. Here, the specified path is relative to the directory specified in the -D spec.
#SBATCH -e ./%x.%j.%N.err(Placeholder) standard error goes there. Note that the directory where the output file is placed must exist before the job starts, and the full path name must be specified (no environment variable!). You can choose any name you like. If you prefer automatic naming, the %x encodes the job name into the output file. The %j encodes the job ID into the output file. The %N encodes the master node of the job and can be added if job IDs from different SLURM clusters might be the same. Here, the specified path is relative to the directory specified in the -D spec.
#SBATCH -D ./directory used by script as starting point (working directory). The directory specified must exist. Here, the path is relative to the submission directory.
#SBATCH --get-user-envSet user environment properly

#SBATCH --clusters=hpda2

#SBATCH --partition=<cluster partition>

Define cluster and partition to run the job on. See Available CPU-cluster partitions.
#SBATCH --cpus-per-task= <number of hyperthreads>Number of hyperthreads per MPI-task. The minimum is --cpus-per-task=1 (i.e. 1 hyperthread), the maximum is --cpus-per-task=160 (i.e. 160 hyperthreads ≡ 80 physical cores ≡ 1 full node).
#SBATCH --gres=gpu: <number of GPUs>Only available on hpda2_compute_gpu partition.  Number of GPUS per job. 
#SBATCH --mem=<e.g. 500mb, 50gb,...>

For parallel distributed memory jobs, the total memory available for the job will be the number of nodes multiplied with the --mem value.

If not specified, the memory of a full node is allocated.

#SBATCH --nodes=<integer number>Number of (shared-memory multi-core) nodes assigned to the job. The number of nodes may be increased up to the queue limit.
#SBATCH --ntasks-per-node=80The number of MPI tasks to start on each node. Typically, the value used here should not be larger than the number of physical cores in a node (i.e. 80). It may be chosen smaller for various reasons (memory needed for a task, hybrid programs, etc).
#SBATCH --mail-type=endSend an e-mail at job completion
#SBATCH --mail-user=<email_address>@<domain>(Placeholder) e-mail address (don't forget, and please enter a valid address!)
#SBATCH --export=NONEDo not export the environment of the submitting shell into the job; while SLURM allows to also use ALL here, this is strongly discouraged, because the submission environment is very likely to be inconsistent with the environment required for execution of the job.
#SBATCH --time=08:00:00Maximum run time is 8 hours 0 minutes 0 seconds in this example. The limit may be increased up to the queue limit.
#SBATCH --account= <your account>If you belong to a terrabyte project, you must specify your group account (which is identical to the project id), in order to benefit from a higher priority in the SLURM-queue: e.g. --account=pn56su. DLR users who got their account through the self-registration portal are in the group hpda-c by default (lowest priority class), but may specify a project id with a higher priority class here as well.
#SBATCH --exclusiveoptional: if set, exclusive mode ist activated.  Runs jobs on a node awithout having to share the node with others (e.g. for performance monitoring).
module load slurm_setupFirst executed line: SLURM settings necessary for proper setup of batch environment.
module load ...load any required environment modules (usually needed if program is linked against shared libraries, or if paths to applications are needed). The "..." is of course a placeholder.
mpiexec -n $SLURM_NTASKS ./my_mpi_prog.exeStart MPI executable. The MPI variant used depends on the loaded module set; non-MPI programs may fail to start up - please consult the example jobs or software-specific documentation for other startup mechanisms. The total number of MPI tasks is supplied by SLURM via the referenced variable. For this example, 224 MPI tasks would be started.

This script essentially looks like a bash script. However, there are specially marked comment lines ("control sequences"), which have a special meaning in the SLURM context. The entries marked "Placeholder" must be suitably modified to have valid user-specific values.

For this script, the environment of the submitting shell will not be exported to the job's environment. The latter is completely set up via the module system inside the script.

Step 3: Submission procedure

The job script is submitted to the queue via the command

sbatch path/to/myjob.cmd

At submission time the control sequences are evaluated and stored in the queuing database, and the script is copied into an internal directory for later execution. If the command was executed successfully, the Job ID will be returned as follows:

Submitted batch job 65648.

It is a good idea to note down your Job ID's, for example to provide to the LRZ Support as information if anything goes wrong. The submission command can also contain control sequences. For example,

sbatch --time=12:00:00 myjob.cmd

would override the setting inside the script, forcing it to run 12 instead of 8 hours.

Step 4: Checking the status of a job

Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. The status of the current jobs of a user can be queried with the squeue --clusters=hpda2 --user=<your user ID> command.The output may look similar to this:

CLUSTER: hpda2

**             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)**

**            100120 hpda2_com  MPIjob1  di82rij  R 1-03:19:49      4 ....**

The state "P" indicates that the job is queued (pending for execution). Once the job is running, the output would indicate the state to be "R" (=running), and would also list the host(s) it was running on. For jobs that have not yet started, the --start option, applied to squeue, will provide an estimate (!) for the starting time. The sinfo --clusters=[all | cluster_name] command prints out an overview of the status of all clusters or a particular clusters in the SLURM configuration.

Inspection and modification of jobs

Queued jobs can be inspected for their characteristics via the command

scontrol --clusters=<cluster_name> show jobid=<job ID>

which will print out a list of "Keyword=Value" pairs which characterize the job. As long as a job is waiting in the queue, it is possible to modify at least some of these; for example, the command

scontrol --clusters=<cluster_name> update jobid=65648 TimeLimit=04:00:00

would change the run time limit of the above-mentioned example job from 8 hours to 4 hours.

Deleting jobs from the queue

To forcibly remove a job from SLURM, the command

scancel --clusters=<cluster_name> <JOB_ID>

can be used. Please do not forget to specify the cluster! The scancel (1) man page provides further information on the use of this command.

Example parallel distributed memory jobs

MPI jobs

For MPI documentation please consult the MPI page. You can either load Intel MPI or openmpi (see examples below). If you need a certain version of MPI, see Software Modules for how to find and load available options.

MPI jobs may be jobs that use MPI only for parallelization ("MPP-style"), or jobs that combine usage of MPI and OpenMP ("hybrid").

A setup as for the hybrid job can also serve to provide more memory per MPI task without using OpenMP (e.g., by setting OMP_NUM_THREADS=1). Note that this will leave cores unused!

Simple MPI job 1Simple MPI Job 2MPI Job using a container

#!/bin/bash

#SBATCH -J only_mpi1

#SBATCH -o ./%x.%j.%N.out

#SBATCH -e ./%x.%j.%N.err

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --cluster=hpda2

#SBATCH --partition=hpda2_compute

#SBATCH --nodes=8

#SBATCH --ntasks-per-node=20

#SBATCH --mail-type=all

#SBATCH --mail-user=insert_your_email_here

#SBATCH --export=NONE

#SBATCH --time=08:00:00

#SBATCH --account=hpda-c

module load slurm_setup

module load mpi.intel

mpiexec -n $SLURM_NTASKS ./my_mpi_program.exe

The example will start 160 MPI tasks distributed over 8 nodes. Consequently, 4 physical cores (8 hyperthreads) will be assigned to each task. The user gets notified by mail in all cases: the job begins, ends or fails.

#! /bin/bash

#SBATCH -J only_mpi2

#SBATCH -o /example/path/to/stdout.logfile

#SBATCH -e /example/path/to/sterr.logfile

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --cluster hpda2

#SBATCH --partition hpda2_compute

#SBATCH --ntasks=160

#SBATCH --cpus-per-task=2

#SBATCH --mail-type=end

#SBATCH --mail-user=insert_your_email_here

#SBATCH --export=NONE

#SBATCH --time 00:10:00

#SBATCH --account=hpda-c

module load slurm_setup

module load mpi.intel

mpiexec -n $SLURM_NTASKS ./my_mpi_program.exe

Each of the 160 MPI tasks will be assigned 1 core (2 hyperthreads). Since "cpu" here actually means hyperthread,  2 nodes are requested (i.e. 320 hyperthreads = 160 physical cores = 2 x 80 cores of a node). The user gets notified by mail when the job ended. 

#!/bin/bash

#SBATCH -J="charliecloud_mpi_test"

#SBATCH -o="output_charliecloud_mpi_test.txt"

#SBATCH -e="error_charliecloud_mpi_test.txt"

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --cluster hpda2

#SBATCH --partition hpda2_compute

#SBATCH --nodes=4

#SBATCH --ntasks-per-node=40

#SBATCH --cpus-per-task=2

#SBATCH --mail-type=fail

#SBATCH --mail-user=insert_your_email_here

#SBATCH --export=NONE

#SBATCH --time=00:30:00

#SBATCH --account=hpda-c

module load slurm_setup

module load charliecloud

module load openmpi

export PMIX_MCA_gds=hash

mpiexec -n $SLURM_NTASKS ch-run -b /dss/.:/dss/ -w /path/to/container/image -- /MPI_TEST/HelloWorld/mpi_hello_world

The above example will start 160 MPI tasks distributed over 4 nodes, each of them with 2 hyperthreads (1 physical core) assigned. The mpi-executable is executed in a Charliecloud container that has the DSS mounted under /dss. In this example openmpi is used instead of Intel-MPI. We provide an example of how to setup a Charliecloud container for the usage of MPI here. The user gets only notified by mail if the job fails.

MPI-OpenMP hybrid job

#! /bin/bash

#SBATCH -J hybrid_mpi_omp

#SBATCH -o ./%x_%j.out

#SBATCH -e ./%x.%j.err

#SBATCH -D ./

#SBATCH --cluster hpda2

#SBATCH --partition hpda2_compute

#SBATCH --ntasks=40

#SBATCH --cpus-per-task=16

#SBATCH --export=NONE

#SBATCH --get-user-env

#SBATCH --time 00:10:00

#SBATCH --account=hpda-c

#SBATCH --exclusive

module load slurm_setup

module load mpi.intel

export OMP_NUM_THREADS=8

mpiexec -n $SLURM_NTASKS ./my_mpi_program.exe

Each of the 40 MPI tasks will be assigned 8 cores (16 hyperthreads), whereas each task is a multi-threaded OpenMP-process that uses 8 physical cores.  Since  "cpu" here actually means hyperthread,  4 nodes are requested (i.e. 640 hyperthreads = 320 physical cores = 4 x 80 cores of anode). The nodes are assigned exclusively to the submitted jobs and not shared with others, due to the --exclusive flag. The user gets notified by mail when the job ended.

Note

Before going into production run with a code which supports hybrid mode, either via OpenMP or via automatic parallelization, please check whether performance is not better running with one thread per MPI process. Please note: Altogether removing -openmp may improve performance of hybrid MPI+OpenMP codes (which then run as pure MPI codes): For these codes, if you are running with OMP_NUM_THREADS set to 1 because you want to run the "pure" MPI case, the performance of your code may be better if you compile/link your code without the -openmp flag. If you compile/link with the flag, the performance of your code may be penalized with the OpenMP overhead even though you don't want to use OpenMP since the compiler may produce less optimized code due to the OpenMP induced code transformations.

For codes that have explicit calls to OpenMP functions, either shield the calls with !$ directives, or compile them for the "pure" MPI case using the -openmp_stubs option instead of -openmp. A code compiled with -openmp_stubs will not work if OMP_NUM_THREADS is set to a value greater than 1.

Note that there may well be cases in which retaining hybrid functionality may give a performance advantage e.g. if your code becomes cache-bound and little shared-memory synchronization is required. But you need to check this, and optimize the number of threads used if you decide in favour of hybrid mode.

MPI jobs on the GPU-cluster

Here we show an example for an MPI-job on the GPU-cluster. The general implications of the CPU-examples above are also applicable.

#!/bin/bash

#SBATCH -J example_benchmark

#SBATCH -o ./%x.%j.%N.out

#SBATCH -e ./%x.%j.%N.err

#SBATCH -D ./

#SBATCH --get-user-env

#SBATCH --cluster=hpda2

#SBATCH --partition=hpda2_compute_gpu

#SBATCH --nodes=2

#SBATCH --ntasks-per-node=4

#SBATCH --gres=gpu:4

#SBATCH --export=NONE

#SBATCH --time=30:00:00

#SBATCH --account=hpda-c

module load openmpi

module load cuda

export CUDA_PATH=$CUDA_BASE

mpiexec -n $SLURM_NTASKS ./Benchmark -i ini/benchmark.ini

The example will start 8 MPI tasks distributed over 2 nodes. Consequently, 1 GPU will be assigned to each task. Note that one node has 4 GPUs and hence the maximum you can specify is --gres=gpu:4! The user gets notified by mail in all cases: the job begins, ends or fails.