Job-Monitoring

 

Introduction

We provide a job orientend monitoring framework (ClusterCockpit) which might help you, to analyse performance issues regarding your jobs. The system is still in testing phase. This means, that some functions and visualizations may change during the time.

How to get access?

We provide this system to all of our users. Simply go to https://jobmon.pc2.uni-paderborn.de and log in with your usual user credentials.

The web page can be accessed eiher via VPN or on-site at Paderborn University.

Quick start

After login, you will find a list of available cluster systems.

You can see, how many runing jobs are recognized in ClusterCockpit and all your jobs in total. Click either on “Running jobs” or “Total jobs” to get a brief overview of monitored jobs. Freshly launched jobs can take up to 5 minutes to appear in ClusterCockpit.

By clicking on the appropriate buttons, you can sort the list by different aspects, display or hide metrics or apply filters on the list. You also may switch the cluster from Noctua 2 to Noctua 1 and vice versa. Click on the job id in the upper left corner of every row, to display all aspects of the job.

In the upper left corner, you will find some meta data of the Job. In the middle and on the right, you will see different diagrams to give a quick feeling of the characteristics of a job. Below those diagrams, each recorded metric is plotted in different granularities. At the bottom of the page, you will find some additional meta data of the job.

One word to the metrics on those pages

The collectors of the metrics do not know anything about slurm jobs. They can measure different metrics on core, socket or node level. Therefore, meaningful metrics can generally only be measured if the job was running exclusively on the node.

This applies in particular to metrics such as cpu_load, mem_used, mem_bw, infiniband (ib*) and lustre metrics. These metrics apply to the whole host or at least socket and may cover more than the selected job.

So if you don’t want to live with this caveats, you should run your job exclusive.

Metrics

Metric name

Meaning

Meaningful for shared jobs

Metric name

Meaning

Meaningful for shared jobs

mem_used

main memory used in the node

no, for shared jobs have a look at the commandline tool seff

flops_any

floating-point operations performed by the CPU cores of the job, value=(FLOPS in FP64)+0.5*(FLOPS in FP32)

yes

mem_bw

memory bandwidth

no, because only the memory bandwidth of a full socket can be monitored

clock

frequency of the CPU cores of the job

yes, but other jobs on the node might affect the CPU frequency

pkg_pwr

Package power of the CPU sockets

no

ipc

Instructions per cycle

yes

ib_recv

Infiniband/Omnipath receive bandwidth

no

ib_xmit

Infiniband/Omnipath transmit bandwidth

no

ip_recv_pkts

Infiniband/Omnipath received packets

no

ip_xmit_pkts

Infiniband/Omnipath transmitted packets

no

cpu_load_core

load on the CPU cores of a job, i.e., the number of processes requesting CPU time on a specific CPU core

yes

cpu_user

percentage of CPU time spend as user time for each CPU core

yes

lustre_open

Lustre (PFS) file open requests

no

lustre_statfs

Lustre (PFS) file stat requests

no

lustre_close

Lustre (PFS) file close requests

no

cpu_load

load on the node, i.e., the number of processes requesting CPU time

no

lustre_read_bw

Lustre (PFS) read bandwidth

no

lustre_write_bw

Lustre (PFS) write bandwidth

no

nv_util

GPU compute utilization

yes

nv_mem_util

usage of the GPU memory in percent

yes

nv_fb_mem_used

usage of the GPU memory in GB

yes

nv_sm_clock

clock frequency of the GPU SMs

yes

nv_compute_processes

number of processes using the GPU

yes

nv_power_usage

power usage of the GPU

yes

Controlling the metric collector and job submission

There might be situations, in which you don’t want, that the metric collector is running on a compute node. For example, if you want to use your own profiler.

You may switch off metric collection by adding --collectors=off to your Job script.

--collectors=[mode] Sets the mode of the collectors for job-specific monitoring on the compute nodes. mode=[off,normal (default)]

Please don’t use this option by default. The metrics will help us define the requirements for a new HPC system in future procurements. Indirectly, they also help you by enabling us to provide you with a suitable HPC system for your workload.

If you don’t want, that your jobs are collected in ClusterCockpit, you can disable the submission of those meta data. In that case, the metric collectors on the node are still running, but no metadata get’s submitted to ClusterCockpit. Simply add --clustercockpit=off to your job script.

--clustercockpit=[mode] Controls if job meta data gets submitted to the job-specific monitoring (ClusterCockpit, https://jobmon.pc2.uni-paderborn.de). mode=[off,normal (default)]