Skip to main content

MPI

MPI provides an API for performing distributed memory message passing in parallel programs. Supported programming languages are Fortran, C, and C++, although others, like Java, Perl, R, and Python can also be integrated.

  • A comprehensive introduction MPI-programming can be found on the LRZ-Webpages. The statements made there are transferable to the hpda terrabyte CPU-cluster.
  • A simple MPI-example can be found on the CARA and CARO Wiki-Pages (DLR internal only).

MPI using a container

Here we show an example of how to set up a Charliecloud container for processing with MPI. The example has been built upon the best-practice examples in the Charliecloud documentation but contains some specific modifications to the terrabyte infrastructure. Whenever dealing with MPI and containers it has to be considered that MPI is very petulant about compatibility between the software on the host and in the container. One major prerequisite is that there is a compatible version of OpenMPI installed on the host. Which versions are compatible seems to be a moving target, but having the same versions inside and outside the container usually works. In this example we use the latest versions of OpenMPI, UCX and SLURM installed on the clusters at the time of writing. You may have to adapt it to later versions in case the software packages have been updated.

  1. Log into login node
  2. Copy the examples folder from the Charliecloud git-repo and cd into the directory
svn export https://github.com/hpc/charliecloud.git/trunk/examples
cd examples
  1. First, build the three images using the Dockerfiles provided with the Charliecloud examples. These three builds should take about 15 minutes in total.

Note that Charliecloud infers their names from the Dockerfile name, so we don't need to specify -t. Also, --force enables some workarounds for tools like distribution package managers that expect to really be root.

ch-image build --force -f Dockerfile.almalinux_8ch ~/examples
ch-image build --force -f Dockerfile.libfabric ~/examples
ch-image build --force -f Dockerfile.openmpi ~/examples
  1. Next, create a new directory for this project and enter it.
mkdir mpi_test
cd mpi_test
  1. Create a new C-program called mpihello.c that contains the following code:
#include <stdio.h>

#include <mpi.h>

int main (int argc, char **argv)
{
int msg = 0, rank, rank_ct;

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &rank_ct);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

printf("hello from rank %d of %d\n", rank, rank_ct);

if (rank == 0) {
for (int i = 1; i < rank_ct; i++) {
MPI_Send(&msg, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
printf("rank %d sent %d to rank %d\n", rank, msg, i);
}
} else {
MPI_Recv(&msg, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("rank %d received %d from rank 0\n", rank, msg);
}

MPI_Finalize();
}
  1. Create a new file called Dockerfile that contains the following code. The instruction WORKDIR changes directories. The default working directory within a Dockerfile is /. The folder /dss has to be created to be able to write to it later.
FROM openmpi
RUN mkdir /hello
RUN mkdir /dss
WORKDIR /hello
COPY mpihello.c .
RUN mpicc -o mpihello mpihello.c
  1. Now build and move the Charliecloud image to the working directory. The default Dockerfile is ./Dockerfile, so we can omit -f.
ch-image build -t mpihello ~/examples/mpi_test
mv /var/tmp/\<YOUR USER ID\>.ch/img/mpihello/ .
  1. Run an interactive job and run the example.
# Get an interactive job on the test cluster
salloc --cluster=hpda2 --partition=hpda2_test --nodes=1 --time=01:00:00 --ntasks-per-node=4

# load the charliecloud and the openmpi module
module use /dss/dsstbyfs01/pn56su/pn56su-dss-0020/usr/share/modules/files/
module load charliecloud openmpi

# export a necessary environment variable for PMI
export PMIX_MCA_gds=hash

# Run the example in the container
mpiexec -np 4 ch-run -w mpihello -- /hello/mpihello
  1. If the output looks like this, the container is correctly configured for using MPI:

MPI with container

MPI with Python

In this section, we present examples on how to use MPI with Python using different software environments: The preinstalled stack-modules (small flexibility), a conda environment (high flexibility) and a software container (very high flexibility). As for all other examples, these are just appetizers you can modify as you like. There are endless possibilites to implement MPI functionalities for your own workflows and needs.

Example with preinstalled stack modules

  1. Login to the login node
  2. Create a Python script that uses MPI. A simple python script may look like this:
from mpi4py import MPI
comm = MPI.COMM_WORLD
print("Hello, my rank is: " + str(comm.rank))
  1. Create a SLURM-script (e.g. named mpi_hello_world_py.cmd). The SLURM script may look like this:
#!/bin/bash
#SBATCH -J mpi_hello_world_py
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH -D ./
#SBATCH \--get-user-env
#SBATCH \--cluster=hpda2
#SBATCH \--partition=hpda2_compute
#SBATCH \--nodes=1
#SBATCH \--ntasks-per-node=4
#SBATCH \--cpus-per-task=2
#SBATCH \--mail-type=all
#SBATCH \--mail-user=insert_your_email_here
#SBATCH \--export=NONE
#SBATCH \--time=00:05:00
#SBATCH \--account=hpda-c

#switch to the latest version of spack modules
module switch spack/23.1.0
module load slurm_setup
#load the intel mpi-module
module load intel-mpi
#load the python module with mpi4py installed
module load python/3.10.10-extended

mpiexec -n \$SLURM_NTASKS python path/to/mpi_hello_world.py
  1. Submit the SLURM-script to the scheduler
sbatch path/to/mpi_hello_world_py.cmd
  1. The output logged in the .out file should look like this:
Hello, my rank is: 0
Hello, my rank is: 1
Hello, my rank is: 2
Hello, my rank is: 3

Example with conda environment

  1. Login to the login node
  2. Create a Python script that uses MPI. A simple python script may look like this:
from mpi4py import MPI
comm = MPI.COMM_WORLD
print("Hello, my rank is: " + str(comm.rank))
  1. Create a conda environment.yaml with the packages you want to install. The environment.yaml may look like this, but can of course be adapted to your individual needs:
name: mpi_test
channels:
- conda-forge
- defaults
dependencies:
- python<=3.9
- numba
- numpy
- rioxarray
- pystac-client
- zarr
- pystac
- numpy
- xarray
- shapely
- geopandas
- jupyterlab
- gdal
Installation

It is important to note that the mpi4py package must not be installed with the environment.yaml file (but with pip in a later step). Otherwise mpi4py cannot be built against the current version of openmpi on the system, which is a prerequisit for MPI to work.

  1. Create the conda environment and install the packages defined in the environment.yaml and the mpi4py package separately.
module load miniconda3
module load openmpi

conda env create -f path/to/environment.yml

#we have to install mpi4py from pip to build it against the openMPI version available on the system
conda activate mpi_test
pip install mpi4py
conda deactivate
  1. Create a SLURM-script (e.g. named mpi_hello_world_py_conda.cmd). The SLURM script may look like this:
Activation

Please note that you should use "source activate" instead of "conda activate" in the SLURM-job to activate your environment. Otherwise you will run into an error, since the conda functions are not exported by default to be made available in the jobs's shell.
This is because your .bashrc containing the necessary conda-information is not sourced in a SLURM-job (for good reasons, to prevent activating things meant for your login shell and potentially clashing with your Slurm job).

#!/bin/bash
#SBATCH -J mpi_hello_world_py_conda
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH -D ./
#SBATCH --get-user-env
#SBATCH --cluster=hpda2
#SBATCH --partition=hpda2_compute
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mail-type=all
#SBATCH --mail-user=insert_your_email_here
#SBATCH --export=NONE
#SBATCH --time=00:05:00
#SBATCH --account=hpda-c

module load slurm_setup miniconda3 openmpi
source activate mpi_test
mpiexec -n $SLURM_NTASKS python path/to/mpi_hello_world.py
  1. Submit the SLURM-script to the scheduler
sbatch path/to/mpi_hello_world_py_conda.cmd
  1. The output logged in the .out file should look like this:
Hello, my rank is: 0
Hello, my rank is: 1
Hello, my rank is: 2
Hello, my rank is: 3