Job-Monitoring
Introduction
We provide a job orientend monitoring framework (ClusterCockpit) which might help you, to analyse performance issues regarding your jobs. The system is still in testing phase. This means, that some functions and visualizations may change during the time.
How to get access?
We provide this system to all of our users. Simply go to https://jobmon.pc2.uni-paderborn.de and log in with your usual user credentials.
The web page can be accessed eiher via VPN or on-site at Paderborn University.
Quick start
After login, you will find a list of available cluster systems.
You can see, how many runing jobs are recognized in ClusterCockpit and all your jobs in total. Click either on “Running jobs” or “Total jobs” to get a brief overview of monitored jobs. Freshly launched jobs can take up to 5 minutes to appear in ClusterCockpit.
By clicking on the appropriate buttons, you can sort the list by different aspects, display or hide metrics or apply filters on the list. You also may switch the cluster from Noctua 2 to Noctua 1 and vice versa. Click on the job id in the upper left corner of every row, to display all aspects of the job.
In the upper left corner, you will find some meta data of the Job. In the middle and on the right, you will see different diagrams to give a quick feeling of the characteristics of a job. Below those diagrams, each recorded metric is plotted in different granularities. At the bottom of the page, you will find some additional meta data of the job.
One word to the metrics on those pages
The collectors of the metrics do not know anything about slurm jobs. They can measure different metrics on core, socket or node level. Therefore, meaningful metrics can generally only be measured if the job was running exclusively on the node.
This applies in particular to metrics such as cpu_load, mem_used, mem_bw, infiniband (ib*) and lustre metrics. These metrics apply to the whole host or at least socket and may cover more than the selected job.
So if you don’t want to live with this caveats, you should run your job exclusive.
Metrics
Metric name | Meaning | Meaningful for shared jobs |
---|---|---|
mem_used | main memory used in the node | no, for shared jobs have a look at the commandline tool |
flops_any | floating-point operations performed by the CPU cores of the job, value=(FLOPS in FP64)+0.5*(FLOPS in FP32) | yes |
mem_bw | memory bandwidth | no, because only the memory bandwidth of a full socket can be monitored |
clock | frequency of the CPU cores of the job | yes, but other jobs on the node might affect the CPU frequency |
pkg_pwr | Package power of the CPU sockets | no |
ipc | Instructions per cycle | yes |
ib_recv | Infiniband/Omnipath receive bandwidth | no |
ib_xmit | Infiniband/Omnipath transmit bandwidth | no |
ip_recv_pkts | Infiniband/Omnipath received packets | no |
ip_xmit_pkts | Infiniband/Omnipath transmitted packets | no |
cpu_load_core | load on the CPU cores of a job, i.e., the number of processes requesting CPU time on a specific CPU core | yes |
cpu_user | percentage of CPU time spend as user time for each CPU core | yes |
lustre_open | Lustre (PFS) file open requests | no |
lustre_statfs | Lustre (PFS) file stat requests | no |
lustre_close | Lustre (PFS) file close requests | no |
cpu_load | load on the node, i.e., the number of processes requesting CPU time | no |
lustre_read_bw | Lustre (PFS) read bandwidth | no |
lustre_write_bw | Lustre (PFS) write bandwidth | no |
nv_util | GPU compute utilization | yes |
nv_mem_util | usage of the GPU memory in percent | yes |
nv_fb_mem_used | usage of the GPU memory in GB | yes |
nv_sm_clock | clock frequency of the GPU SMs | yes |
nv_compute_processes | number of processes using the GPU | yes |
nv_power_usage | power usage of the GPU | yes |
Controlling the metric collector and job submission
There might be situations, in which you don’t want, that the metric collector is running on a compute node. For example, if you want to use your own profiler.
You may switch off metric collection by adding --collectors=off
to your Job script.
--collectors=[mode] Sets the mode of the collectors for job-specific
monitoring on the compute nodes. mode=[off,normal
(default)]
Please don’t use this option by default. The metrics will help us define the requirements for a new HPC system in future procurements. Indirectly, they also help you by enabling us to provide you with a suitable HPC system for your workload.
If you don’t want, that your jobs are collected in ClusterCockpit, you can disable the submission of those meta data. In that case, the metric collectors on the node are still running, but no metadata get’s submitted to ClusterCockpit. Simply add --clustercockpit=off
to your job script.
--clustercockpit=[mode] Controls if job meta data gets submitted to the
job-specific monitoring (ClusterCockpit,
https://jobmon.pc2.uni-paderborn.de).
mode=[off,normal (default)]