Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
  • Known issues with Intel MPI versions 2021.6.0, 2021.7, 2021.1 8 and 2021.9 .0 on Noctua 2:
    Collective routines may hang on Noctua 2 when using one of these versions. We are currently investigating the issue. As a workaround, use an older version, such as 2021.5.0These versions are included in oneAPI versions from 22.2.0 up to including 23.1.0. The issue may only show for sufficiently large numbers of ranks.

    • Recommended solution: We recommend using either Intel MPI 2021.5 or older (as included in oneAPI 22.1.2 and older) or Intel MPI 2021.10 or newer (as included in oneAPI 23.2.0 and newer). If you are using one of the intel toolchains from our module tree, you can reload the mpi/impi module in one of the non-affected versions:
      module load mpi/impi/2021.5.0-intel-compilers-2022.0.1

    • Alternative workaround (may have performance impact): The problem can also be worked around by setting the FI_PROVIDER environment variable in one of the following ways:

      • FI_PROVIDER=psm3

      • FI_PROVIDER=verbs

  • Software that brings its own copy of Intel MPI may fail on startup with an error similar to:

    Code Block
    Fatal error in MPI_Init: Other MPI error, error stack:
    MPIR_Init_thread(805): fail failed
    MPID_Init(1743)......: channel initialization failed
    MPID_Init(2137)......: PMI_Init returned -1
    • Reason: Within compute jobs, we automatically set the I_MPI_PMI_LIBRARY environment variable to allow running MPI jobs using srun with Intel MPI. However, codes that bring their own copy of Intel MPI usually use a different startup mechanism (e.g., mpirun) which fails when this variable is set.

    • Solution: Unset the variable in your job script using unset I_MPI_PMI_LIBRARY.

    • When you can call your software with srun, you can resolve the same type of error easier by adding --mpi=pmi2 to the invocation, e.g. srun -n 3 --mpi=pmi2 ./mw_fpga

  • In an interactive SLURM Job running your application under mpirun / mpiexec might just lead to a hangup (doing nothing indefinitely). To fix this, set I_MPI_HYDRA_BOOTSTRAP=ssh.