Skip to main content

Introduction

Clusters

The HPDA terrabyte compute infrastructure provides two clusters optimized for different use cases. The CPU cluster offers a large amount of cores for precise, flexible and fast parallel processing and can be used for a wide range of tasks. The GPU cluster is optimized for ultra-fast processing of large amounts of data and is especially suited for machine learning tasks. Both clusters are hosted at the LRZ (Leibniz Supercomputing Centre) in Garching near Munich.

CPU clusterGPU cluster
Number of nodes26247
Number of CPUs per node22
Number of cores per CPU40 (80 Hyperthreads)24 (48 Hyperthreads)
Number of GPUs per Node04
CPU TypeIntel Xeon Platinum 8380 40C 270W 2.3GHzIntel Xeon Gold 6336Y 24C 185W 2.4GHz
GPU TypeN/ANVIDIA HGX A100 80GB 500W
RAM per node1024 GByte(1024 + 320) GByte
Bandwidth to Infiniband HDR per node200 GBit/s200 GBit/s
LINPACK computing power per node4,5 TFlop/s68,5 TFlop/s
Memory bandwidth per node409,6 GByte/s(409,6 + 8156) GByte/s

CPU-Clusters

The HPDA terrabyte CPU-clusters consists of several partitions. While some of the cluster's partitions are reserved for internal services and testing, two of them are currently available for the public:

Cluster specificationsLimits
ClusterPartitionNodes
in partition

CPU Cores and

Hyperthreads

per node

Typical job typeNode range
per job
min - max
Maximum
runtime
(hours)
CPU Cores and HyperthreadsMemory limit
(GByte)
Cluster system: Intel Xeon Platinum 8380 40C 270W 2.3GHz nodes with Infiniband interconnect and 2 hardware threads per physical core

hpda2
hpda2_compute53

80 Cores

160 Hyperthreads

1-53240*-

1024

per node

hpda2_test2

Do not run production jobs!

1-12

80 Cores

160 Hyperthreads

hpda2_jupyter21-148

4 Cores

8 Hyperthreads

* If your job is too long for the maximum runtime of the cluster, you can implement auto-requeuing of the job in your SLURM job script.

GPU-Cluster

Cluster specificationsLimits
ClusterPartitionNodes
in partition

GPUs

per node

CPU Cores and

Hyperthreads

per node

Typical job typeNode range
per job
min - max
Maximum
runtime
(hours)
GPUsMemory limit GPU
(GByte)
CPU Cores and HyperthreadsMemory limit CPU
(GByte)
Cluster system: NVIDIA HGX A100 80GB 500W GPU nodes with Intel Xeon Gold 6336Y 24C 185W 2.4GHz CPUs (with Infiniband interconnect and 2 hardware threads per physical core)

hpda2
hpda2_compute_gpu144

48 Cores

96 Hyperthreads

1-12240*-

320

per node

-

1024

per node

hpda2_testgpu1

Do not run production jobs!

1-124

48 Cores

96 Hyperthreads

* If your job is too long for the maximum runtime of the cluster, you can implement auto-requeuing of the job in your SLURM job script.

Storage

Both the CPU- and the GPU-cluster are directly attached to a dedicated GPFS-storage system (Data Science Storage, DSS) with a capacity of about 50 PB net. The DSS hosts a large amount of various earth observation and auxiliary data and offers the possibility to store personal data (HOME), project data (dedictated storage containers) and intermediate data (SCRATCH).

Access

terrabyte HPC uses the command line to create, run and manage processing jobs on the cluster. For this, we rely on the workload manager SLURM, which is a state-of-the-art scheduling software of HPC-centers all around the world. Knowing how to use SLURM is a prerequisite to bring your processing to the large scale and to make full use of our available hardware resources. But don't be scared if you have never heared about SLURM or HPC. We have written down all you need in the documentation and it is easy and fast to learn. Learn about ways to run your processes on terrabyte in the job submission section. Jobs can be either started from the command line (interactive test jobs) or script-driven (for production jobs).