PC2-Documentation

Current Issues and Upcoming System Events

System

Event

System

Event

Noctua 2 GPU-Nodes: November 13th - December 4th

Due to a reservation of A100 GPUs for a large compute-time project which needs a significant number of GPUs continuously, less GPUs are available in this time period for other projects. Please also consider the availability of A100 GPUs in the DGX as well as the A40 GPUs in Noctua 1. We would be happy to support you when moving some jobs to the Noctua1 GPU nodes, please let us know via pc2-support@uni-paderborn.de.

Noctua 2 Compute Nodes: May, 8th-?

Information/Warning: Some multi-node calculations are failing with error messages like
[n2cn...] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
These calculations seem to trigger an instability in the Infiniband connection between the nodes and the first-level Infiniband switches.
We are investigating the issue. For the time being please notify us if this happens for you (please include the jobids of the failed jobs) and restart the calculation excluding the nodes the calculation ran on (sbatch -x NODENAMES... , you can find the node names with sacct -p -j JOBID -o nodelist or by having a look into the output of the failed job.)

System status

Noctua 1

Login Nodes

LNet Router to Pling 3

Job Submission

Compute Nodes

Parallel Filesystem

GPU Nodes

CIFS / NFSv4 Export of PFS

JupyterHub

Noctua 2

Login Nodes

Compute Nodes

Job Submission

FPGA Nodes

Parallel Filesystem

GPU Nodes

CIFS / NFSv4 Export of PFS

JupyterHub

Central services

Jobmonitoring

Central fileserver (Home- and Group-Directories)

Authentication services

 

 

Maintenance is planned