OpenMPI issues
Noctua 1
when using ls1 Mardyn compiled with GCC 11.2.0 and OpenMPI 4.1.1 there can be a memory leak in a few mpi-ranks that will eventually kill the simulation because it runs out of memory. Other combinations of GCC and OpenMPI might be affected, too.
A suitable workaround is to use the Intel compilers and Intel MPI.
Noctua 2
Errors/Warnings when using /tmp for example by OpenMPI:
Background: To guarantee a sufficient isolation between user, jobs and nodes the /tmp directory is redirected to an isolated directory on the parallel file system that only exists during the compute job. Some programs might expect /tmp to reside in main memory and might issue a warning or an error if it isn’t.
Workaround:
for OpenMP:
If you compile OpenMPI yourself and run programs with mpirun, please include “orte_tmpdir_base = /dev/shm” in your openmpi-mca-params.conf.
If you compile OpenMPI yourself and run programs with srun, please make sure that you have compiled OpenMPI with PMIx-support. We have set SLURM_PMIX_TMPDIR="/dev/shm” globally which will become effective then.
OpenMPI on gpu partition:
Occasionally MPI jobs may fail on the gpu partition due to UCX error messages such as (We are investigating the issue):
[1648816196.947405] [n2gpu1201:476591:0] mm_posix.c:206 UCX ERROR open(file_name=/proc/476593/fd/44 flags=0x0) failed: No such file or directory [1648816196.947422] [n2gpu1201:476591:0] mm_ep.c:158 UCX ERROR mm ep failed to connect to remote FIFO id 0xc000000b000745b1: Shared memory error ...
Some applications might not flush their output into stdout/stderr. In these cases please use the
srun
option-u
or--unbuffered
in your job scripts.