Known issues with Intel MPI versions 2021.6.0, 2021.7.1 and 2021.9.0 on Noctua 2:
Collective routines may hang on Noctua 2 when using one of these versions. We are currently investigating the issue. As a workaround, use an older version, such as 2021.5.0:module load mpi/impi/2021.5.0-intel-compilers-2022.0.1
Software that brings its own copy of Intel MPI may fail on startup with an error similar to:
Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(805): fail failed MPID_Init(1743)......: channel initialization failed MPID_Init(2137)......: PMI_Init returned -1
Reason: Within compute jobs, we automatically set the
I_MPI_PMI_LIBRARY
environment variable to allow running MPI jobs usingsrun
with Intel MPI. However, codes that bring their own copy of Intel MPI usually use a different startup mechanism (e.g., mpirun) which fails when this variable is set.Solution: Unset the variable in your job script using
unset I_MPI_PMI_LIBRARY
.When you can call your software with srun, you can resolve the same type of error easier by adding
--mpi=pmi2
to the invocation, e.g.srun -n 3 --mpi=pmi2 ./mw_fpga
In an interactive SLURM Job running your application under
mpirun
/mpiexec
might just lead to a hangup (doing nothing indefinitely). To fix this, setI_MPI_HYDRA_BOOTSTRAP=ssh
.
General
Content
Integrations