Noctua 2 Huge-Memory Nodes

The huge-memory nodes in Noctua 2 have 2 TB memory by default and, in addition, 12x 3 TB fast NVME SSDs in each node.

Performance

The key performance figures of these nodes are:

  • CPU: 2x64 AMD EPYC Milan 7713, 3.7 TFLOP/s in HPL benchmark

  • Memory: 2 TB DDR4 at 360 GB/s in STREAM triad benchmark

  • SSDs: 12x 3 TB PCIe4 SSDs

    • individual performance (aka. one SSD):

      • 7.0 GB/s sequential read (128 KiB blocks), 5.0 GB/s sequential write (128 KiB blocks)

      • 1000 kIOPS random read (4 KiB blocks), 800 kIOPS random write (4 KiB blocks)

    • aggregated performance (aka. whole node):

      • 84.0 GB/s sequential read (128 KiB blocks), 60.0 GB/s sequential write (128 KiB blocks)

      • 12,000 kIOPS random read (4 KiB blocks), 9,600 kIOPS random write (4 KiB blocks)

Usage

The 2 TB of main memory in a huge-mem node is conventional DDR4 memory and can be used like with other nodes. If 2 TB per node is sufficient for your calculation, you can stop reading here.

The SSD storage can be used in several different ways depending on your workload:

Using an Existing IO Mechanism of Your Code

This is often the case, for example, in quantum chemistry, where most codes can write out quantities like electron-repulsion integrals that otherwise would require large main memory.

Depending on your workload, several different operating modes of the SSDs can be configured when submitting a job:

  1. The SSDs can be combined into one large file system.

    • add the option --hugememssd=combined to your job submission

    • then about 36 TB will be available at $LOCAL_SSD_ALL and you can for example use this path as the temporary working directory for your application.
      Here is a short example of how this could be used with ORCA:

      cp molecule.inp $LOCAL_SSD_ALL cd $LOCAL_SSD_ALL orca molecule.inp > $SLURM_SUBMIT_DIR/molecule.out cp molecule* $SLURM_SUBMIT_DIR
    • all data on the SSDs will be automatically removed at the end of the job

  2. The SSDs are used as separate storage, where each has its own file system.

    • add the option --hugememssd=individual to your job submission

    • each SSDs is then prepared with its own file system and made available at
      LOCAL_SSD_01, …, LOCAL_SSD_12

    • If your parallel processes can work on individual working directories, then this mode should be preferred because treating each SSDs as an individual device/file system will allow for higher read and write throughput than the combined mode above.

    • We provide a wrapper that automatically sets the SSDs as TMPDIR environment variables indivdually in a round-robin fashion for all MPI ranks:
      This can be seen in the following example:

      #!/bin/bash #SBATCH -N 1 #SBATCH --exclusive #SBATCH -n 32 #SBATCH -p hugemem #SBATCH -t 10:00 #SBATCH --hugememssd=individual module load tools/hugemem-ssd-helper/helper srun ssd_wrapper bash -c 'echo "Rank: $PMIX_RANK TMPDIR: $TMPDIR"'

      which results in the following output showing that the 32 tasks have their TMPDIR set to different SSDs:

      Rank: 21 TMPDIR: /mnt/nvme_10/rank_21 Rank: 23 TMPDIR: /mnt/nvme_12/rank_23 Rank: 5 TMPDIR: /mnt/nvme_06/rank_5 Rank: 10 TMPDIR: /mnt/nvme_11/rank_10 Rank: 14 TMPDIR: /mnt/nvme_03/rank_14 Rank: 26 TMPDIR: /mnt/nvme_03/rank_26 Rank: 30 TMPDIR: /mnt/nvme_07/rank_30 Rank: 19 TMPDIR: /mnt/nvme_08/rank_19 Rank: 28 TMPDIR: /mnt/nvme_05/rank_28 Rank: 0 TMPDIR: /mnt/nvme_01/rank_0 Rank: 7 TMPDIR: /mnt/nvme_08/rank_7 Rank: 27 TMPDIR: /mnt/nvme_04/rank_27 Rank: 4 TMPDIR: /mnt/nvme_05/rank_4 Rank: 22 TMPDIR: /mnt/nvme_11/rank_22 Rank: 25 TMPDIR: /mnt/nvme_02/rank_25 Rank: 2 TMPDIR: /mnt/nvme_03/rank_2 Rank: 20 TMPDIR: /mnt/nvme_09/rank_20 Rank: 6 TMPDIR: /mnt/nvme_07/rank_6 Rank: 16 TMPDIR: /mnt/nvme_05/rank_16 Rank: 1 TMPDIR: /mnt/nvme_02/rank_1 Rank: 29 TMPDIR: /mnt/nvme_06/rank_29 Rank: 11 TMPDIR: /mnt/nvme_12/rank_11 Rank: 15 TMPDIR: /mnt/nvme_04/rank_15 Rank: 18 TMPDIR: /mnt/nvme_07/rank_18 Rank: 3 TMPDIR: /mnt/nvme_04/rank_3 Rank: 13 TMPDIR: /mnt/nvme_02/rank_13 Rank: 8 TMPDIR: /mnt/nvme_09/rank_8 Rank: 31 TMPDIR: /mnt/nvme_08/rank_31 Rank: 17 TMPDIR: /mnt/nvme_06/rank_17 Rank: 9 TMPDIR: /mnt/nvme_10/rank_9 Rank: 24 TMPDIR: /mnt/nvme_01/rank_24 Rank: 12 TMPDIR: /mnt/nvme_01/rank_12
    • all data on the SSDs will be automatically removed at the end of the job

By Transparently Using the SSDs as an Extension of Main Memory

This mode is useful if your application requires more main memory than is available as physical main memory on the huge-memory nodes, i.e., more than 2 TB. Then allocations of memory like malloc or related system calls can be intercepted and instead of the memory being allocated in physical main memory, it can instead be allocated in virtual memory and mapped to the SSDs. The kernel page cache automatically acts as a buffer between the SSDs and the application.

To facilitate this interception of the memory allocations, we use the library MEMKIND (https://github.com/memkind/memkind) which we have modified to support multiple SSDs via chunking.

Please let us know via pc2-support@uni-paderborn.de if you would like to use this mode, as it requires some testing and configuration that we are happy to take care of.

The basic steps are:

  1. In a job script or interactive job on a hugemem node with the sbatch argument “--hugememssd=individual". This argument mounts all SSDs as individual devices.

  2. Load the module system/memkind/multidevice-mmap-foss-2022b

  3. Adapt the default parameters set by this module.

    • AUTO_PMEM_SIZE: The most important is the filter setting, which determines for which allocations the SSDs are used: AUTO_PMEM_SIZE=1k:100000000M

      • The first number determines the lower size limit. The second number determines the upper size limit.

      • AUTO_PMEM_SIZE=1k:100000000M means that all allocations with sizes between 1 KB and 100 TB are allocated on the SSDs for all other allocations the main memory is used.

    • AUTO_PMEM_CHUNKSIZE:

      • This setting determines the chunk size. Extent allocations, i.e. allocations of the underlying heap manager jemalloc, that are larger than this value in bytes are split up over the available SSDs and chunked with this chunk size.

      • This setting is particularly important if your program performs very large allocations.

    • MEMKIND_DEBUG:

      • A value of 0 disables logging.

      • A value of 1 enables messages for each extent allocation on the SSDs. Especially in the beginning, this is helpful for analyzing the behavior.

    • In addition, we can adapt the parameters of the linux page cache, which acts as a cache for the allocations on the SSDs.

  4. Test run your application and monitor the behavior by ssh-ing to the hugemem node with your job:

    • htop: nicely shows CPU activity and memory usage

    • iostat -xh 1: shows storage device activity (nvme0n1-nvme11n1 are the relevant devices)