Introduction

We provide a job orientend monitoring framework (ClusterCockpit) which might help you, to analyse performance issues regarding your jobs. The system is still in testing phase. This means, that some functions and visualizations may change during the time.

The monitoring framework is currently only installed on Noctua 2. We will setup it on Noctua 1 within the next few weeks.

How to get access?

We provide this system to all of our users. Simply go to https://jobmon.pc2.uni-paderborn.de and log in with your usual user credentials.

The web page can be accessed eiher via VPN or on-site at Paderborn University.

Quick start

After login, you will find a list of available cluster systems. Currently, only noctua 2 is supported.

You can see, how many runing jobs are recognized in ClusterCockpit and all your jobs in total. Click either on “Running jobs” or “Total jobs” to get a brief overview of monitored jobs. Freshly launched jobs can take up to 5 minutes to appear in ClusterCockpit.

By clicking on the appropriate buttons, you can sort the list by different aspects, display or hide metrics or apply filters on the list. You also may switch the cluster from Noctua 2 to Noctua 1 and vice versa. Click on the job id in the upper left corner of every row, to display all aspects of the job.

In the upper left corner, you will find some meta data of the Job. In the middle and on the right, you will see different diagrams to give a quick feeling of the characteristics of a job. Below those diagrams, each recorded metric is plotted in different granularities. At the bottom of the page, you will find some additional meta data of the job.

One word to the metrics on those pages

The collectors of the metrics do not know anything about slurm jobs. They can measure different metrics on core, socket or node level. Therefore, meaningful metrics can generally only be measured if the job was running exclusively on the node.

This applies in particular to metrics such as cpu_load, mem_used, mem_bw, infiniband (ib*) and lustre metrics. These metrics apply to the whole host or at least socket and may cover more than the selected job.

So if you don’t want to live with this caveats, you should run your job exclusive.

Metrics

Metric name	Meaning	Meaningful for shared jobs
mem_used	main memory used in the node	no, for shared jobs have a look at the commandline tool `seff`
flops_any	floating-point operations performed by the CPU cores of the job, value=(FLOPS in FP64)+0.5*(FLOPS in FP32)	yes
mem_bw	memory bandwidth	no, because only the memory bandwidth of a full socket can be monitored
clock	frequency of the CPU cores of the job	yes, but other jobs on the node might affect the CPU frequency
pkg_pwr	Package power of the CPU sockets	no
ipc	Instructions per cycle	yes
ib_recv	Infiniband/Omnipath receive bandwidth	no
ib_xmit	Infiniband/Omnipath transmit bandwidth	no
ip_recv_pkts	Infiniband/Omnipath received packets	no
ip_xmit_pkts	Infiniband/Omnipath transmitted packets	no
cpu_load_core	load on the CPU cores of a job, i.e., the number of processes requesting CPU time on a specific CPU core	yes
cpu_user	percentage of CPU time spend as user time for each CPU core	yes
lustre_open	Lustre (PFS) file open requests	no
lustre_statfs	Lustre (PFS) file stat requests	no
lustre_close	Lustre (PFS) file close requests	no
cpu_load	load on the node, i.e., the number of processes requesting CPU time	no
lustre_read_bw	Lustre (PFS) read bandwidth	no
lustre_write_bw	Lustre (PFS) write bandwidth	no
nv_util	GPU compute utilization	yes
nv_mem_util	usage of the GPU memory in percent	yes
nv_fb_mem_used	usage of the GPU memory in GB	yes
nv_sm_clock	clock frequency of the GPU SMs	yes
nv_compute_processes	number of processes using the GPU	yes
nv_power_usage	power usage of the GPU	yes

Controlling the metric collector and job submission

There might be situations, in which you don’t want, that the metric collector is running on a compute node. For example, if you want to use your own profiler.

You may switch of off metric collection by adding --collectors=off to your Job script.

Code Block
--collectors=[mode] Sets the mode of the collectors for job-specific monitoring on the compute nodes. mode=[off,normal (default)]

Please don’t use this option by default. The metrics will help us define the requirements for a new HPC system in future procurements. Indirectly, they also help you by enabling us to provide you with a suitable HPC system for your workload.

If you don’t want, that your jobs are collected in ClusterCockpit, you can disable the submission of those meta data. In that case, the metric collectors on the node are still running, but no metadata get’s submitted to ClusterCockpit. Simply add --clustercockpit=off to your job script.

Code Block
--clustercockpit=[mode] Controls if job meta data gets submitted to the job-specific monitoring (ClusterCockpit, https://jobmon.pc2.uni-paderborn.de). mode=[off,normal (default)]

Versions Compared

Old Version 11

New Version 12

Key

Contents

Introduction

How to get access?

Quick start

One word to the metrics on those pages

Metrics

Controlling the metric collector and job submission

Page Comparison

Versions Compared

Old Version 11

New Version 12

Key

Contents

Introduction

How to get access?

Quick start

One word to the metrics on those pages

Metrics

Controlling the metric collector and job submission