Noctua 2 GPU-Nodes: November 13th - December 4th
Due to a reservation of A100 GPUs for a large compute-time project which needs a significant number of GPUs continuously, less GPUs are available in this time period for other projects. Please also consider the availability of A100 GPUs in the DGX as well as the A40 GPUs in Noctua 1. We would be happy to support you when moving some jobs to the Noctua1 GPU nodes, please let us know via firstname.lastname@example.org.
Noctua 2 Compute Nodes: May, 8th-?
Information/Warning: Some multi-node calculations are failing with error messages like
[n2cn...] ib_mlx5_log.c:171 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
These calculations seem to trigger an instability in the Infiniband connection between the nodes and the first-level Infiniband switches.
We are investigating the issue. For the time being please notify us if this happens for you (please include the jobids of the failed jobs) and restart the calculation excluding the nodes the calculation ran on (
sbatch -x NODENAMES... , you can find the node names with
sacct -p -j JOBID -o nodelist or by having a look into the output of the failed job.)