OpenMPI issues

Noctua 1

  • when using ls1 Mardyn compiled with GCC 11.2.0 and OpenMPI 4.1.1 there can be a memory leak in a few mpi-ranks that will eventually kill the simulation because it runs out of memory. Other combinations of GCC and OpenMPI might be affected, too.

    • A suitable workaround is to use the Intel compilers and Intel MPI.

Noctua 2

  • Errors/Warnings when using /tmp for example by OpenMPI:

    • Background: To guarantee a sufficient isolation between user, jobs and nodes the /tmp directory is redirected to an isolated directory on the parallel file system that only exists during the compute job. Some programs might expect /tmp to reside in main memory and might issue a warning or an error if it isn’t.

    • Workaround:

      • for OpenMP:

        • If you compile OpenMPI yourself and run programs with mpirun, please include “orte_tmpdir_base = /dev/shm” in your openmpi-mca-params.conf.

        • If you compile OpenMPI yourself and run programs with srun, please make sure that you have compiled OpenMPI with PMIx-support. We have set SLURM_PMIX_TMPDIR="/dev/shm” globally which will become effective then.

  • OpenMPI on gpu partition:

    • Occasionally MPI jobs may fail on the gpu partition due to UCX error messages such as (We are investigating the issue):

      [1648816196.947405] [n2gpu1201:476591:0] mm_posix.c:206 UCX ERROR open(file_name=/proc/476593/fd/44 flags=0x0) failed: No such file or directory [1648816196.947422] [n2gpu1201:476591:0] mm_ep.c:158 UCX ERROR mm ep failed to connect to remote FIFO id 0xc000000b000745b1: Shared memory error ...
  • Some applications might not flush their output into stdout/stderr. In these cases please use the srun option -u or --unbuffered in your job scripts.