


Horovod is a distributed training framework for TensorFlow.

- Homepage: https://github.com/uber/horovod

Noctua 2

This table is generated automatically. If you need other versions please click pc2-support@uni-paderborn.de.

Multi-GPU Training with Horovod and DeepMD

Installation as a Python virtual environment (one time)

We recommend installing Python virtual environments in the group directory on permanent storage of your project (/pc2/groups/[project name]) instead of the home directory or the parallel file system (/scratch/[project name]). Please replace [project name] with the name of your compute time project, i.e. hpc-prf-…

module reset module load toolchain/foss/2022a module load lang/Python/3.10.4-GCCcore-11.3.0 module load system/CUDA/12.6.0 rm -rf venv python3 -m venv venv source /pc2/groups/[project name]/venv_deepmd_horovod/bin/activate pip install cmake pip install mpi4py pip install tensorflow pip install deepmd-kit[gpu,cu12] export HOROVOD_WITH_TENSORFLOW=1 export HOROVOD_WITH_MPI=1 pip install horovod[tensorflow]

Usage in a compute job

#!/bin/bash #SBATCH -N 1 #SBATCH -t 2:00:00 #SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=32 #SBTACH --gpus-per-task=1 #SBATCH --gres=gpu:a100:2 #SBATCH -p gpu #alternatively for testing in small/short jobs, see https://upb-pc2.atlassian.net/wiki/spaces/PC2DOK/pages/1902952/Running+Compute+Jobs#Using-GPUs-for-Development-and-Testing-Purposes ##SBATCH -p dgx ##SBATCH -q devel export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export OMP_PLACES=cores export OMP_PROC_BIND=true export TF_INTER_OP_PARALLELISM_THREADS=2 export TF_INTRA_OP_PARALLELISM_THREADS=`echo "$OMP_NUM_THREADS-$TF_INTER_OP_PARALLELISM_THREADS" | bc` module reset module load toolchain/foss/2022a module load lang/Python/3.10.4-GCCcore-11.3.0 module load system/CUDA/12.6.0 source /pc2/groups/[project name]/venv_deepmd_horovod/bin/activate srun dp train input.json