Intel MPI Issues
Known issues with Intel MPI versions 2021.6, 2021.7, 2021.8 and 2021.9 on Noctua 2
Collective routines may hang on Noctua 2 when using one of these versions. These versions are included in oneAPI versions from 22.2.0 up to including 23.1.0 and intel toolchains from 2022a up to including 2023a. The issue may only show for sufficiently large numbers of ranks.Recommended solution: We recommend using either Intel MPI 2021.5 or older (as included in oneAPI 22.1.2 and older) or Intel MPI 2021.10 or newer (as included in oneAPI 23.2.0 and newer). If you are using one of the intel toolchains from our module tree, you can reload the mpi/impi module in one of the non-affected versions:
module load mpi/impi/2021.10.0-intel-compilers-2023.2.1
Alternative workaround (may have performance impact): The problem can also be worked around by setting the FI_PROVIDER environment variable in one of the following ways:
FI_PROVIDER=psm3
FI_PROVIDER=verbs
Software that brings its own copy of Intel MPI may fail on startup with an error similar to:
Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(805): fail failed MPID_Init(1743)......: channel initialization failed MPID_Init(2137)......: PMI_Init returned -1
Reason: Within compute jobs, we automatically set the
I_MPI_PMI_LIBRARY
environment variable to allow running MPI jobs usingsrun
with Intel MPI. However, codes that bring their own copy of Intel MPI usually use a different startup mechanism (e.g., mpirun) which fails when this variable is set.Solution: Unset the variable in your job script using
unset I_MPI_PMI_LIBRARY
.When you can call your software with srun, you can resolve the same type of error easier by adding
--mpi=pmi2
to the invocation, e.g.srun -n 3 --mpi=pmi2 ./mw_fpga
In an interactive SLURM Job running your application under
mpirun
/mpiexec
might just lead to a hangup (doing nothing indefinitely). To fix this, setI_MPI_HYDRA_BOOTSTRAP=ssh
.