Introduction

Clusters

The HPDA terrabyte compute infrastructure provides two clusters optimized for different use cases. The CPU cluster offers a large amount of cores for precise, flexible and fast parallel processing and can be used for a wide range of tasks. The GPU cluster is optimized for ultra-fast processing of large amounts of data and is especially suited for machine learning tasks. Both clusters are hosted at the LRZ (Leibniz Supercomputing Centre) in Garching near Munich.

	CPU cluster	GPU cluster
Number of nodes	262	47
Number of CPUs per node	2	2
Number of cores per CPU	40 (80 Hyperthreads)	24 (48 Hyperthreads)
Number of GPUs per Node	0	4
CPU Type	Intel Xeon Platinum 8380 40C 270W 2.3GHz	Intel Xeon Gold 6336Y 24C 185W 2.4GHz
GPU Type	N/A	NVIDIA HGX A100 80GB 500W
RAM per node	1024 GByte	(1024 + 320) GByte
Bandwidth to Infiniband HDR per node	200 GBit/s	200 GBit/s
LINPACK computing power per node	4,5 TFlop/s	68,5 TFlop/s
Memory bandwidth per node	409,6 GByte/s	(409,6 + 8156) GByte/s

CPU-Clusters

The HPDA terrabyte CPU-clusters consists of several partitions. While some of the cluster's partitions are reserved for internal services and testing, two of them are currently available for the public:

	Cluster specifications				Limits
Cluster	Partition	Nodes in partition	CPU Cores and Hyperthreads per node	Typical job type	Node range per job min - max	Maximum runtime (hours)	CPU Cores and Hyperthreads	Memory limit (GByte)
Cluster system: Intel Xeon Platinum 8380 40C 270W 2.3GHz nodes with Infiniband interconnect and 2 hardware threads per physical core
hpda2	hpda2_compute	53	80 Cores 160 Hyperthreads	Script-driven SLURM jobs (production)	1-53	240*	-	1024 per node
	hpda2_test	2		Interactive jobs (testing) Do not run production jobs!	1-1	2	80 Cores 160 Hyperthreads
	hpda2_jupyter	2		Jupyter via Port-Forwarding	1-1	48	4 Cores 8 Hyperthreads

* If your job is too long for the maximum runtime of the cluster, you can implement auto-requeuing of the job in your SLURM job script.

GPU-Cluster

	Cluster specifications					Limits
Cluster	Partition	Nodes in partition	GPUs per node	CPU Cores and Hyperthreads per node	Typical job type	Node range per job min - max	Maximum runtime (hours)	GPUs	Memory limit GPU (GByte)	CPU Cores and Hyperthreads	Memory limit CPU (GByte)
Cluster system: NVIDIA HGX A100 80GB 500W GPU nodes with Intel Xeon Gold 6336Y 24C 185W 2.4GHz CPUs (with Infiniband interconnect and 2 hardware threads per physical core)
hpda2	hpda2_compute_gpu	14	4	48 Cores 96 Hyperthreads	Script-driven SLURM jobs (production) Jupyter via Port-Forwarding	1-12	240*	-	320 per node	-	1024 per node
hpda2	hpda2_testgpu	1	4	48 Cores 96 Hyperthreads	Interactive jobs (testing) Do not run production jobs!	1-1	2	4	320 per node	48 Cores 96 Hyperthreads	1024 per node

* If your job is too long for the maximum runtime of the cluster, you can implement auto-requeuing of the job in your SLURM job script.

Storage

Both the CPU- and the GPU-cluster are directly attached to a dedicated GPFS-storage system (Data Science Storage, DSS) with a capacity of about 50 PB net. The DSS hosts a large amount of various earth observation and auxiliary data and offers the possibility to store personal data (HOME), project data (dedictated storage containers) and intermediate data (SCRATCH).

Access

terrabyte HPC uses the command line to create, run and manage processing jobs on the cluster. For this, we rely on the workload manager SLURM, which is a state-of-the-art scheduling software of HPC-centers all around the world. Knowing how to use SLURM is a prerequisite to bring your processing to the large scale and to make full use of our available hardware resources. But don't be scared if you have never heared about SLURM or HPC. We have written down all you need in the documentation and it is easy and fast to learn. Learn about ways to run your processes on terrabyte in the job submission section. Jobs can be either started from the command line (interactive test jobs) or script-driven (for production jobs).

Clusters​

CPU-Clusters​

GPU-Cluster​

Storage​

Access​

Clusters

CPU-Clusters

GPU-Cluster

Storage

Access