Monitoring cluster resources

Monitor all resources

The availability of cluster resources is of great interest for all users. While sinfo and scontrol are powerful built-in tools to monitor SLURM and cluster resources, they are not that handy to work with. We provide a tool called check_cluster that wraps around scontrol and nicely visualizes the most important information on the command line:

Check Cluster

The tool can be easily loaded from our terrabyte modules:

module use /dss/dsstbyfs01/pn56su/pn56su-dss-0020/usr/share/modules/files/
module load check_cluster
 
check_cluster

Monitor GPU usage

Nvidia-smi is a well known standard-tool for monitoring GPU-resources on a single-node-level and is available on all of our GPU-nodes.

Nevertheless, we encourage you to try cluster-smi, which is the same as nvidia-smi but for multiple nodes. This is advantageous especially for jobs that run on multiple GPUs in parallel.

You can simply use it by running cluster-smi on the login node. You then should see the usage information of all available nodes:

cluster-smi

Cluster SMI 1

In order to filter for jobs running under your user name, you can use the -u flag + your terrabyte-ID:

cluster-smi -u <your ID>

In order to get more detailed information, as well as a time stamp of the measurement, you may use the -p and -t flags:

cluster-smi -p -t

Cluster SMI 2

You can use the -n flag to filter all nodes and only show those matching the given regEx:

cluster-smi -n hpdar01c05s02

Cluster SMI 3

Monitor all resources​

Monitor GPU usage​

Monitor all resources

Monitor GPU usage