Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

If you prefer to learn about job submission in a course with handson sessions please consider our HPC-Introduction courses.

Summary

  • PC2 cluster system use SLURM as a scheduler/workload manager.

  • sharing of compute nodes among jobs is enabled

  • cpu-cores, memory and GPUs are allocated exclusively to jobs

  • SMT (aka. hyperthreading) is disabled on the compute nodes

  • start your MPI jobs with srun and not mpirun or mpiexec

  • You only see your own compute jobs. If you are the project administrator of a compute-time project you see all jobs of the project.

  • extra tools: scluster, spredict, squeue_pretty

Contents

Table of Contents

If you prefer to learn about job submission in a course with handson sessions please consider our HPC-Introduction courses.

Cluster and Usage Overview

On PC2 cluster system you can only see your own compute jobs but not the compute jobs of other users. This is configured this way because of data privacy reasons. If you are the project administrator of a compute time project you can see all jobs from your compute time project.

scluster

To get an idea of how busy the cluster is you can use the command-line tool scluster which will give you an overview of the current utilization of the cluster in terms of total, used, idle and unavailable nodes for each partition.

pc2status

pc2status is a command-line tool to get the following information:

  • file system quota of your home directory

  • your personal compute time usage in the express-QoS

  • your membership in compute-time projects

  • per project that you are a member of:

    • start, end, granted ressources

    • usage and usage contingents

    • current priority level

    • file system quota and usage

Jobs

You can either run batch jobs or interactive jobs.

Batch Jobs

In a batch jobs you set up a bash script that contains the commands that you want to execute in a job, i.e.

Code Block
#!/bin/bash
echo "Hello World"

It is recommended to include parameters for the job in the job script. For SLURM these lines start with #SBATCH. Because they are comments, i.e. they start with # they are ignored by the normal bash shell but read and interpreted by SLURM.

Important parameters are

Line

Mandatory

Meaning

Code Block
#SBATCH -t TIMELIMIT

YES

specify the time limit of your job. Acceptable time formats for TIMELIMIT include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

Code Block
#SBATCH -N NODES

no, default is 1

use at least the number of NODES for the job

Code Block
#SBATCH -n NTASKS

no, default is 1

run NTASKS tasks in your job. A task is usually an MPI rank.

Code Block
#SBATCH --mem-per-cpu MEM

no, default is memory-per-node/number-of-cores

memory per allocated cpu core, e.g. 1000M or 2G for 1000 MB or 2 GB respectively.

Code Block
#SBATCH -J NAME

no, default is the file name of the job script

specify the NAME of the compute job

Code Block
#SBATCH -p PARTITION

no, default is the normal partition

Submit jobs to the PARTITION partition. For a description of partition see Node Types and Partitions

Code Block
#SBATCH -A PROJECT

not if you are only member of one compute-time project

specify the compute-time project PROJECT to use with this job

Code Block
#SBATCH -q QOS

no, default is the default QoS of your project

Use the QoS QOS. For a description of QoS see Node Types and Partitions

Code Block
#SBATCH --mail-type MAILTYPE

no, default value is NONE

specify at which event you want a mail notification. MAILTYPEcan be NONE, BEGIN, END, FAIL, REQUEUE, ALL.

Code Block
#SBATCH --mail-user MAILADDRESS

no

specify your mail that should receive the mail notifications

So that overall a practical example great_calculation.sh could look like:

Code Block
#!/bin/bash
#SBATCH -t 2:00:00
#SBATCH -N 2
#SBATCH -n 10
#SBATCH -J "great calculation"
#SBATCH -p normal

#run your application here

Many more options can be found in the man page of sbatch at https://slurm.schedmd.com/sbatch.html or by running man sbatch on the command line on the cluster.

Submitting Batch Jobs

Once you have a job script, e.g. great_calculation.sh, you can submit it to the workload manager with the command

sbatch great_calculation.sh

If everything went well, it will return a job id which is a unique integer number that identifies your job. This means that your job is now queued for execution.

To monitor the state of your jobs please have a look at https://uni-paderborn.atlassian.net/wiki/spaces/PC2DOK/pages/12944358/How+to+Monitor+Your+Jobs.

Monitor the State of Your Job

The main tool to monitor the state of your jobs is squeue. It will give you a list of your current pending and running jobs. We recommend squeue_pretty which is a wrapper around squeue and gives you more information that is also formatted in a nicer way. You can also use the command-line tool spredict to get an estimate of the starting time of your pending jobs. Note that this is just an estimate an can change rapidly if other people submit jobs with a higher priority than yours or runnign jobs finish early or are canceled. If no estimate is shown this means that SLURM hasn’t estimated the start time of your job yet. If not time is shown even several minutes after the job submission then your job is simply not in the limited time window that SLURM uses.

Stopping Batch Jobs

You can simply cancel jobs with the command scancel, i.e. by runnning scancel JOBID where JOBID is the id of the job you want to cancel. You can also cancel all your jobs with scancel -u USERNAME.

Running Jobs with Parallel Calculations, i.e. MPI

For parallel calculations with MPI we recommend using srun and not mpirun or mpiexec. For example to start a job with 8 MPI-ranks and 4 threads per rank use

Code Block
#!/bin/bash
#SBATCH -t 2:00:00
#SBATCH -n 8
#SBATCH --cpus-per-task=4
#SBATCH -J "great calculation"

export OMP_NUM_THREADS=4
srun ...

The output will contain an overview on the pinning of your tasks to cpu cores like

Code Block
cpu-bind=MASK - n2cn0167, task  0  0 [2169499]: mask 0xf set
cpu-bind=MASK - n2cn0167, task  1  1 [2169500]: mask 0xf0 set
cpu-bind=MASK - n2cn0167, task  2  2 [2169501]: mask 0xf00 set
cpu-bind=MASK - n2cn0167, task  3  3 [2169502]: mask 0xf000 set
cpu-bind=MASK - n2cn0167, task  4  4 [2169503]: mask 0xf0000 set
cpu-bind=MASK - n2cn0167, task  5  5 [2169504]: mask 0xf00000 set
cpu-bind=MASK - n2cn0167, task  6  6 [2169505]: mask 0xf000000 set
cpu-bind=MASK - n2cn0167, task  7  7 [2169506]: mask 0xf0000000 set

This tells you for example that task 0 has been pinned to to the first 8 cpu cores of the node n2cn0167, the second task to the next 8 and so on. Please Note: We are working on an improvement of the output so that instead of hexadicimal numbers descibing the cpu map, you will get an nice layout.

The recommendations for numbers of MPI ranks and threads for the individual clusters can be found in Noctua 1 and Noctua 2 .

Interactive Jobs

In an interactive job you type the commands to execute yourself in real time. Interactive jobs are not recommended on an HPC cluster because you usually don't know beforehand when your job is going to be started. Details on interactive jobs can be found in Interactive Jobs .

Using GPUs

Using the GPUs in our clusters is easy. Simply specify the GRES --gres=gpu:TYPE:COUNT in your job request, e.g. add

Code Block
#SBATCH --gres=gpu:a100:3

to request 3 NVIDIA A100 GPUs per node in a job.

Available GPU types depending on the cluster are:

Cluster

GRES type

GPU type

Noctua 1

1080ti

NVIDIA GTX 1080 TI 12 GB

Noctua 1

2080ti

NVIDIA RTX 2080 TI 12 GB

Noctua 2

a100

NVIDIA A100 40 GB with NVLINK

Using Nodes Exclusively

Compute nodes on our clusters are can be shared by multiple compute jobs. Please note that the requested number of cpu cores, memory, and GPUs are always allocated exclusively for a job. That means that if multiple jobs run on a compute node, they will not share the same cpu cores, memory or GPUs.

If you want to use a complete node exclusively for your job, i.e. don’t want other peoples jobs to use cpu cores or memory that your job hasn’t allocated, then you can add

#SBATCH --exclusive

to your job script. Be aware that you then have to “pay“ for the whole node with your compute-time contingent even if didn’t allocated all cores or memory for your jobs.