Summary

PC2 cluster system use SLURM as a scheduler/workload manager.
sharing of compute nodes among jobs is enabled
cpu-cores, memory and GPUs are allocated exclusively to jobs
SMT (aka. hyperthreading) is disabled on the compute nodes
start your MPI jobs with srun and not mpirun or mpiexec
You only see your own compute jobs. If you are the project administrator of a compute-time project you see all jobs of the project.
extra tools: scluster, spredict, squeue_pretty

1 Summary
2 Contents
3 Cluster and Usage Overview
- 3.1 scluster
- 3.2 pc2status
4 Jobs
- 4.1 Limitations
- 4.2 Batch Jobs
  - 4.2.1 Submitting Batch Jobs
  - 4.2.2 Monitor the State of Your Job
  - 4.2.3 Stopping Batch Jobs
  - 4.2.4 Running Jobs with Parallel Calculations, i.e. MPI
- 4.3 Interactive Jobs
- 4.4 Using GPUs
- 4.5 Using GPUs for Development and Testing Purposes
- 4.6 Using Nodes Exclusively
- 4.7 Setting your Compute-Time Project

If you prefer to learn about job submission in a course with hands-on sessions please consider our HPC-Introduction courses.

Cluster and Usage Overview

On PC2 cluster system you can only see your own compute jobs but not the compute jobs of other users. This is configured this way because of data privacy reasons. If you are the project administrator of a compute time project you can see all jobs from your compute time project.

scluster

To get an idea of how busy the cluster is you can use the command-line tool scluster which will give you an overview of the current utilization of the cluster in terms of total, used, idle and unavailable nodes for each partition.

pc2status

pc2status is a command-line tool to get the following information:

file system quota of your home directory
your personal compute time usage in the express-QoS
your membership in compute-time projects
per project that you are a member of:
- start, end, granted ressources
- usage and usage contingents
- current priority level
- file system quota and usage

Jobs

You can either run batch jobs or interactive jobs.

Limitations

We have limited the number of jobs submitted per user, account and partition combination to 50,000 concurrent jobs. In general, this should be sufficient. If you need more, please send an email to pc2-support@uni-paderborn.de with a brief explanation.
The size of a job array is limited to 20,000 jobs.

Batch Jobs

Diagram that shows a jobscript submitted to the workload manager

In a batch job you set up a bash script that contains the commands that you want to execute in a job, i.e.

#!/bin/bash

# Place the actual commands you need to run your job here. 
# Below is an example of a simple "Hello World" and the name of the machine where the script runs.
echo "Hello world, I am running on node $HOSTNAME"

Let's name the script hello_world_job.sh.

It is recommended to include parameters for the job in the job script. For SLURM these lines start with #SBATCH. Because they are comments, i.e. they start with # they are ignored by the normal bash shell but read and interpreted by SLURM.

Important parameters are

Line	Mandatory	Meaning

Line	Mandatory	Meaning
`#SBATCH -t TIMELIMIT`	YES	specify the time limit of your job. Acceptable time formats for `TIMELIMIT` include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
`#SBATCH -N NODES`	no, default is 1	use at least the number of `NODES` for the job
`#SBATCH -n NTASKS`	no, default is 1	run `NTASKS` tasks in your job. A task is usually an MPI rank.
`#SBATCH --cpus-per-task=NCPUS`	no	run each task with `NCPUS` processors.
`#SBATCH --mem-per-cpu MEM`	no, default is memory-per-node/number-of-cores	memory per allocated cpu core, e.g. 1000M or 2G for 1000 MB or 2 GB respectively.
`#SBATCH --mem MEM`	no	memory per node
`#SBATCH -J NAME`	no, default is the file name of the job script	specify the `NAME` of the compute job
`#SBATCH -p PARTITION`	no, default is the normal partition	Submit jobs to the `PARTITION` partition. For a description of partition see Node Types and Partitions
`#SBATCH -A PROJECT`	not if you are only member of one compute-time project	specify the compute-time project `PROJECT` to use with this job
`#SBATCH -q QOS`	no, default is the default QoS of your project	Use the QoS `QOS`. For a description of QoS see Quality-of-Service (QoS) and Job Priorities
`#SBATCH --mail-type MAILTYPE`	no, default value is `NONE`	specify at which event you want a mail notification. `MAILTYPE`can be NONE, BEGIN, END, FAIL, REQUEUE, ALL.
`#SBATCH --mail-user MAILADDRESS`	no	specify your mail that should receive the mail notifications
`#SBATCH --kill-on-bad-exit`	no, default value is 1	Kill the entire job if one task fails. Possible values are 0 or 1.

So that overall a practical example submit_great_calculation.sh could look like:

#!/bin/bash
#SBATCH -t 2:00:00
#SBATCH -N 2
#SBATCH -n 10
#SBATCH -J "great calculation"
#SBATCH -p normal

# Load any necessary modules for your job.
# Below is an example to load LIKWID (a tool for performance monitoring and benchmarking).
module reset
module load tools
module load likwid

# Run your application here.
# Below is an example to run the "Hello World" program with LIKWID.
bash hello_world_job.sh

Many more options can be found in the man page of sbatch at Slurm Workload Manager - sbatch or by running man sbatch on the command line on the cluster.

Submitting Batch Jobs

Once you have a job script, e.g. submit_great_calculation.sh, you can submit it to the workload manager with the command

sbatch submit_great_calculation.sh

If everything went well, it will return a job id which is a unique integer number that identifies your job. This means that your job is now queued for execution.

To monitor the state of your jobs please have a look at the next section.

Monitor the State of Your Job

The main tool to monitor the state of your jobs is squeue. It will give you a list of your current pending and running jobs. We recommend squeue_pretty which is a wrapper around squeue and gives you more information that is also formatted in a nicer way. You can also use the command-line tool spredict to get an estimate of the starting time of your pending jobs. Note that this is just an estimate an can change rapidly if other people submit jobs with a higher priority than yours or runnign jobs finish early or are canceled. If no estimate is shown this means that SLURM hasn’t estimated the start time of your job yet. If not time is shown even several minutes after the job submission then your job is simply not in the limited time window that SLURM uses.

Stopping Batch Jobs

You can simply cancel jobs with the command scancel, i.e. by runnning scancel JOBID where JOBID is the id of the job you want to cancel. You can also cancel all your jobs with scancel -u USERNAME.

Running Jobs with Parallel Calculations, i.e. MPI

For parallel calculations with MPI we recommend using srun and not mpirun or mpiexec. For example to start a job with 8 MPI-ranks and 4 threads per rank use

#!/bin/bash
#SBATCH -t 2:00:00
#SBATCH -n 8
#SBATCH --cpus-per-task=4
#SBATCH -J "great calculation"

export OMP_NUM_THREADS=4
srun ...

The output will contain an overview on the pinning of your tasks to cpu cores like

cpu-bind=MASK - n2cn0506, task  0  0 [1794493]: mask |BBBB----|--------||--------|--------||--------|--------||--------|--------||||--------|--------||--------|--------||--------|--------||--------|--------|  set
cpu-bind=MASK - n2cn0506, task  1  1 [1794494]: mask |----BBBB|--------||--------|--------||--------|--------||--------|--------||||--------|--------||--------|--------||--------|--------||--------|--------|  set
cpu-bind=MASK - n2cn0506, task  2  2 [1794495]: mask |--------|BBBB----||--------|--------||--------|--------||--------|--------||||--------|--------||--------|--------||--------|--------||--------|--------|  set

This tells you for example that task 0 has been pinned to the first 4 cpu cores of the node n2cn0506, the second task to the next 4 and so on.

If you don't want this output, you can set

export SLURM_CPU_BIND=cores,quiet
export OMPI_MCA_hwloc_base_report_bindings=false

in your ~/.bashrc.

The recommendations for number of MPI ranks and threads for the individual clusters can be found in Noctua 1 and Noctua 2.

Interactive Jobs

In an interactive job you type the commands to execute yourself in real time. Interactive jobs are not recommended on an HPC cluster because you usually don't know beforehand when your job is going to be started. Details on interactive jobs can be found in Interactive Jobs .

Using GPUs

Using the GPUs in our clusters is easy. Simply specify the GRES --gres=gpu:TYPE:COUNT in your job request, e.g. add

#SBATCH --gres=gpu:a100:3

to request 3 NVIDIA A100 GPUs per node in a job.

Available GPU types depending on the cluster are:

Cluster	GRES type	GPU type	Topology

Cluster	GRES type	GPU type	Topology
Noctua 1	a40	NVIDIA A40 48 GB	2 GPUs per Node
Noctua 2	a100	NVIDIA A100 40 GB with NVLINK	4 GPUs per Node

Using GPUs for Development and Testing Purposes

In order to prioritize the use of the main GPU partition for production workloads on Noctua 2, we have set up a secondary, smaller GPU partition (NVIDIA DGX 8xA100 40 GB with NVLINK and NVSWITCH) exclusively for testing and development purposes. This GPU partition should only be used for these specific purposes rather than for production workloads. On Noctua 1, you can, of course, use the normal GPU-partition also for development and testing purposes.

You can run (interactive) jobs on that partition by specifying --qos=devel --partition=dgx --gres=gpu:a100:$NGPUs in your job request (replace $NGPUs with the number of GPUs you would like to use).

There are a few restrictions:

Depending on the amount of GPUs you are requesting, the maximum time limit for your job changes:
1 GPUs (+ 16 CPU-cores) for 240 minutes (04:00:00)
2 GPUs (+ 32 CPU-cores) for 210 minutes (03:30:00)
3 GPUs (+ 48 CPU-cores) for 180 minutes (03:00:00)
4 GPUs (+ 64 CPU-cores) for 150 minutes (02:30:00)
5 GPUs (+ 80 CPU-cores) for 120 minutes (02:00:00)
6 GPUs (+ 96 CPU-cores) for 90 minutes (01:30:00)
7 GPUs (+ 112 CPU-cores) for 60 minutes (01:00:00)
8 GPUs (+ 128 CPU-cores) for 30 minutes (00:30:00)
You can have only one active and one pending job in the slurm queue.

Noctua 2: If you want to develop or test within interactive slurm sessions, this bash script, which automatically calculates time limits for you, may be useful:

#!/bin/bash

if [[ "$1" != "" && "$1" -le 8 ]]; then
    NGPU=$1
else
    echo "Defaulting to 1 GPU."
    NGPU=1
fi
TIME=$((270-30*$NGPU))
HOURS=$(($TIME/60))
MINUTES=$(($TIME%60))
CORES_PER_GPU=16
CORES=$(($CORES_PER_GPU * $NGPU))
printf "Requesting %d GPUs (+ %d CPU-cores) for %d minutes (%02d:%02d:00)\n\n" $NGPU $CORES $TIME $HOURS $MINUTES
srun -N 1 -n 1 -c $CORES --gres=gpu:a100:$NGPU --qos=devel -p dgx -t $TIME --pty bash

Noctua 1: If you want to develop or test within interactive slurm sessions, this bash script, which automatically calculates time limits for you, may be useful:

#!/bin/bash

if [[ "$1" != "" && "$1" -le 2 ]]; then
    NGPU=$1
else
    echo "Defaulting to 1 GPU."
    NGPU=1
fi
TIME=$((4*60))
HOURS=$(($TIME/60))
MINUTES=$(($TIME%60))
CORES_PER_GPU=20
CORES=$(($CORES_PER_GPU * $NGPU))
printf "Requesting %d GPUs (+ %d CPU-cores) for %d minutes (%02d:%02d:00)\n\n" $NGPU $CORES $TIME $HOURS $MINUTES
srun -N 1 -n 1 -c $CORES --gres=gpu:a40:$NGPU -p gpu -t $TIME --pty bash

Using Nodes Exclusively

Compute nodes on our clusters can be shared by multiple compute jobs. Please note that the requested number of cpu cores, memory, and GPUs are always allocated exclusively for a job. That means that if multiple jobs run on a compute node, they will not share the same cpu cores, memory or GPUs.

If you want to use a complete node exclusively for your job, i.e. don’t want other peoples jobs to use cpu cores or memory that your job hasn’t allocated, then you can add

#SBATCH --exclusive

to your job script. Be aware that you then have to “pay“ for the whole node with your compute-time contingent even if didn’t allocated all cores or memory for your jobs.

Setting your Compute-Time Project

Compute-time projects are named accounts in SLURM. If you are member of multiple compute-time projects on a cluster then you have to specify the project to be used for a job. You can do this either in the job submission via

-A PROJECT as a command-line argument
or as #SBATCH -A PROJECT in a job script.

If you want to set a project as a default, you can put export SLURM_ACCOUNT=PROJECT and export SBATCH_ACCOUNT=PROJECT for example in your ~/.bashrc.

Replace PROJECT with the name of your compute project (e.g., hpc-prf-foo).

Running Compute Jobs