Issue Tags

The background Job-Monitoring on our clusters profiles all jobs and records performance metrics.
We process these metrics automatically and recognize behavior that might indicate performance or efficiency issues.
See the following list for issues that are currently detected, how they are recognized and possible reasons and solutions for the problems.
Note: the rules and thresholds are recurrently adjusted. Older jobs might have been analyzed using older rules/thresholds and the tagging might change once we reanalyze older jobs.

General advice

Check your applications documentations on HPC specific advice, such as parallelization and optimization.
Check the Software specific documentation in our wiki. It may contain hints on proper execution on our systems.
While a job is running, you can connect to the compute node via SSH. Now you can check the proper execution, e.g. using htop to inspect CPU utilization.
Carefully inspect the performance metrics for your job in the Job-Monitoring portal. Maybe you can spot unintended behavior or misconfigurations.
Details about the performance metrics and their interpretation you can find in the HPC Wiki:
- https://hpc-wiki.info/hpc/Performance_metrics
- https://hpc-wiki.info/hpc/Job_efficiency

Short job

The job duration is quite short.
Threshold: 60 seconds of job duration

Why this may be an issue

Many short jobs add additional stress to the scheduling system and lead to slower scheduling decisions.
Additionally, every job comes with an overhead due to node preparation and startup.

Possible reasons

Your job had an error and aborted before finishing. Check your job for errors. If the job is not recognized as failed, check if your job script returns a non-zero return value in case of failure.
Your job workload is too small. Define a larger workload by, for example, merging multiple jobs.

Low CPU load

The job does not run enough computations for the allocated CPU cores.
Threshold: 0.2 per core of cpu_load_core

Why this may be an issue

Allocated CPU cores that do not compute anything are basically wasted for your compute time and also for other users that could have used them.
Your workload could also be finished faster if you used the idling resources.

Possible reasons

You allocated more cores than used by your application.
- Check if your application needs that many cores.
Your application is blocked by waiting for other resources.
- Check if you can mitigate the blocking or hide the waiting time with other computations.
Your application is wrongly configured to use less cores than available.
- Check your application configuration, e.g. number of ranks or threads in OpenMP/MPI/Slurm configuration.

Low GPU load

The job does not run enough computations for the allocated GPUs.
Threshold: 0.4 per GPU of nv_util

Why this may be an issue

Allocated GPUs that do not compute anything are basically wasted for your compute time and also for other users that could have used them.
Your workload could also be finished faster if you used the idling resources.

Possible reasons

You allocated more GPUs than used by your application.
- Check if your application needs that many GPUs.
Your application is blocked by waiting for other resources.
- Check if you can mitigate the blocking or hide the waiting time with other computations.
Your application is wrongly configured to use less cores than available.
- Check your application configuration.
Your job only uses the GPU in a small fraction of its complete runtime.
- Check if you can split the workloads to isolate and batch the GPU work.
- Maybe your GPU workload can be executed entirely on CPU and your jobs could make use of the large number of CPU cores on the cluster.

GPU multi process

The job runs multiple processes that use the same GPU.
Threshold: 2 process per GPU of nv_compute_processes

Why this may be an issue

The computations of the processes could be executed on multiple GPUs in parallel to reduce job runtime.
Although, your application could also let multiple processes use the same GPU for efficiency reasons.

Possible reasons

Your job wrongly allocated less GPUs than intended.
- Check your job script to allocate multiple GPUs per job.
Your application is wrongly detecting only one GPU.
- Check your application's configuration to make use of multiple GPUs.

CPU high core

The job runs more processes/threads in parallel than there are allocated cores.
Threshold: 1.1 per core of cpu_load_core

Why this may be an issue

Multiple threads/processes can't run in parallel on one core, they have to alternate.
Computations can run faster if they are executed in parallel on multiple cores.
Additionally, there is an overhead in switching between multiple processes/threads that reduces efficiency.

Possible reasons

Your application is wrongly configured to use more processes/threads than cores available.
- Check your application configuration, e.g. number of ranks or threads in OpenMP/MPI/Slurm configuration.
You allocated less cores than intended.
- Check your job allocation.

CPU load node imbalance

Your job distributes work unevenly between multiple nodes.
Threshold: 0.1 difference in cpu_load_core between nodes

Why this may be an issue

Unevenly distributed workload means that one node waits idling for a different node to finish its workload.
More even work distribution may reduce the total runtime of the job.

Possible reasons

The unit of distribution is too large.
- Reducing the size of work items might lead to more evenly distributed workload.
You are not making use of the allocated nodes/CPU cores.
- See the Low CPU load issue.

CPU load core imbalance

Your job distributes work unevenly between multiple cores.
Threshold: 0.1 difference in cpu_load_core between cores

Why this may be an issue

Unevenly distributed workload means that one core waits idling for a different node to finish its workload.
More even work distribution may reduce the total runtime of the job.

Possible reasons

The unit of distribution is too large.
- Reducing the size of work items might lead to more evenly distributed workload.
You are not making use of the allocated nodes/CPU cores.
- See the Low CPU load issue.

Lustre high file number/Lustre high bandwidth

Your job uses files very often or reads/writes a lot of data from/to the parallel filesystem (PFS).
Threshold: 200 requests/s in lustre_open+lustre_statfs or 100 MB/s in lustre_read_bw or lustre_write_bw

Why this may be an issue

File operations come with a high overhead and may slowdown your computation.
How large the overhead is, depends on the interval of file operations, the size of the data and the location of the files (main memory-backed filesystem, parallel filesystem, etc.).

Possible reasons

Your application re-reads or overwrites the same files repeatedly.
- Check if your application can cache the read data or write the results only once in the end.
The parallel filesystem is used.
- The parallel filesystem connects all compute nodes to the storage over the high-bandwidth network.
  However, it is performant in working with large chunks from few files, but is sensitive to working with small chunks from many files.
  If you can't avoid the operations, one possible solution might be to copy the data first to node local (main memory-backed) filesystem (/dev/shm) and using the copied files.
  The same can be done with re-written result files: write them to the /dev/shm and at end of the job copy them to the parallel filesystem.
Your application writes extensive log files.
- Repeated writing during your computations might block your computation, because of the slow file operations.
  You can turn off unnecessary logging or hide the I/O operation latency with computations.