Intel MPI Issues

  • Known issues with Intel MPI versions 2021.6, 2021.7, 2021.8 and 2021.9 on Noctua 2
    Collective routines may hang on Noctua 2 when using one of these versions. These versions are included in oneAPI versions from 22.2.0 up to including 23.1.0 and intel toolchains from 2022a up to including 2023a. The issue may only show for sufficiently large numbers of ranks.

    • Recommended solution: We recommend using either Intel MPI 2021.5 or older (as included in oneAPI 22.1.2 and older) or Intel MPI 2021.10 or newer (as included in oneAPI 23.2.0 and newer). If you are using one of the intel toolchains from our module tree, you can reload the mpi/impi module in one of the non-affected versions:
      module load mpi/impi/2021.10.0-intel-compilers-2023.2.1

    • Alternative workaround (may have performance impact): The problem can also be worked around by setting the FI_PROVIDER environment variable in one of the following ways:

      • FI_PROVIDER=psm3

      • FI_PROVIDER=verbs

  • Software that brings its own copy of Intel MPI may fail on startup with an error similar to:

    Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(805): fail failed MPID_Init(1743)......: channel initialization failed MPID_Init(2137)......: PMI_Init returned -1
    • Reason: Within compute jobs, we automatically set the I_MPI_PMI_LIBRARY environment variable to allow running MPI jobs using srun with Intel MPI. However, codes that bring their own copy of Intel MPI usually use a different startup mechanism (e.g., mpirun) which fails when this variable is set.

    • Solution: Unset the variable in your job script using unset I_MPI_PMI_LIBRARY.

    • When you can call your software with srun, you can resolve the same type of error easier by adding --mpi=pmi2 to the invocation, e.g. srun -n 3 --mpi=pmi2 ./mw_fpga

  • In an interactive SLURM Job running your application under mpirun / mpiexec might just lead to a hangup (doing nothing indefinitely). To fix this, set I_MPI_HYDRA_BOOTSTRAP=ssh.