Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The background Job-Monitoring on our clusters profiles all jobs and records performance metrics.
We process these metrics automatically and recognize behavior that might indicate performance or efficiency issues.
See the following list for issues that are currently detected, how they are recognized and possible reasons and solutions for the problems.
Note: the rules and thresholds are recurrently adjusted. Older jobs might have been analyzed using older rules/thresholds and the tagging might change once we reanalyze older jobs.

General advice

  • Check your applications documentations on HPC specific advice, such as parallelization and optimization.

  • Check the Software specific documentation in our wiki. It may contain hints on proper execution on our systems.

  • While a job is running, you can connect to the compute node via SSH. Now you can check the proper execution, e.g. using htop to inspect CPU utilization.

  • Carefully inspect the performance metrics for your job in the Job-Monitoring portal. Maybe you can spot unintended behavior or misconfigurations.

  • Details about the performance metrics and their interpretation you can find in the HPC Wiki:

...

The job duration is quite short.
Threshold: 600 60 seconds of job duration

Why this may be an issue

Many short jobs add additional stress to the scheduling system and lead to slower scheduling decisions.
Additionally, every job comes with an overhead due to node preparation and startup.

...

  • Your job had an error and aborted before finishing. Check your job for errors. If the job is not recognized as failed, check if your job script returns a non-zero return value in case of failure.

  • Your job workload is too small. Define a larger workload by, for example, merging multiple jobs.

...

The job does not run enough computations for the allocated CPU cores.
Threshold: 0.9 2 per core of cpu_load_core

Why this may be an issue

Allocated CPU cores that do not compute anything are basically wasted for your compute time and also for other users that could have used them.
Your workload could also be finished faster if you used the idling resources.

...

The job does not run enough computations for the allocated GPUs.
Threshold: 0.7 4 per GPU of nv_util

Why this may be an issue

Allocated GPUs that do not compute anything are basically wasted for your compute time and also for other users that could have used them.
Your workload could also be finished faster if you used the idling resources.

...

The job runs multiple processes that use the same GPU.
Threshold: 2 process per GPU of nv_compute_processes

Why this may be an issue

The computations of the processes could be executed on multiple GPUs in parallel to reduce job runtime.
Although, your application could also let multiple processes use the same GPU for efficiency reasons.

...

The job runs more processes/threads in parallel than there are allocated cores.
Threshold: 1.1 per core of cpu_load_core

Why this may be an issue

Multiple threads/processes can't run in parallel on one core, they have to alternate.
Computations can run faster if they are executed in parallel on multiple cores.
Additionally, there is an overhead in switching between multiple processes/threads that reduces efficiency.

...

Your job distributes work unevenly between multiple nodes.
Threshold: 0.1 difference in cpu_load_core between nodes

Why this may be an issue

Unevenly distributed workload means that one node waits idling for a different node to finish its workload.
More even work distribution may reduce the total runtime of the job.

...

Your job distributes work unevenly between multiple cores.
Threshold: 0.1 difference in cpu_load_core between cores

Why this may be an issue

Unevenly distributed workload means that one core waits idling for a different node to finish its workload.
More even work distribution may reduce the total runtime of the job.

...

Your job uses files very often or reads/writes a lot of data from/to the parallel filesystem (PFS).
Threshold: 100 200 requests/s in lustre_open+lustre_statfs or 100 MB/s in lustre_read_bw or lustre_write_bw

Why this may be an issue

File operations come with a high overhead and may slowdown your computation.
How large the overhead is, depends on the interval of file operations, the size of the data and the location of the files (main memory-backed filesystem, parallel filesystem, etc.).

...