Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Description

Horovod is a distributed training framework for TensorFlow.

More information

- Homepage: https://github.com/uber/horovod

Available Versions of Horovod

Version

Module

Available on

0.28.1-foss-2022a-CUDA-11.7.0-TensorFlow-2.11.0

tools/Horovod/0.28.1-foss-2022a-CUDA-11.7.0-TensorFlow-2.11.0

Noctua 2

This table is generated automatically. If you need other versions please click pc2-support@uni-paderborn.de.

Usage Hints for Horovod

If you need support in using this software or example job scripts please click pc2-support@uni-paderborn.de.

Multi-GPU Training with Horovod and DeepMD

Installation as a Python virtual environment (one time)

We recommend installing Python virtual environments in the group directory on permanent storage of your project (/pc2/groups/[project name]) instead of the home directory or the parallel file system (/scratch/[project name]). Please replace [project name] with the name of your compute time project, i.e. hpc-prf-…

module reset
module load toolchain/foss/2022a
module load lang/Python/3.10.4-GCCcore-11.3.0
module load system/CUDA/12.6.0
rm -rf venv
python3 -m venv venv
source /pc2/groups/[project name]/venv_deepmd_horovod/bin/activate

pip install cmake
pip install mpi4py
pip install tensorflow
pip install deepmd-kit[gpu,cu12]
export HOROVOD_WITH_TENSORFLOW=1
export HOROVOD_WITH_MPI=1
pip install horovod[tensorflow]

Usage in a compute job

#!/bin/bash
#SBATCH -N 1
#SBATCH -t 2:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:a100:2
#SBATCH -p gpu

#alternatively for testing in small/short jobs, see https://upb-pc2.atlassian.net/wiki/spaces/PC2DOK/pages/1902952/Running+Compute+Jobs#Using-GPUs-for-Development-and-Testing-Purposes
##SBATCH -p dgx
##SBATCH -q devel

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
export OMP_PROC_BIND=true

module reset
module load toolchain/foss/2022a
module load lang/Python/3.10.4-GCCcore-11.3.0
module load system/CUDA/12.6.0

source /pc2/groups/[project name]/venv_deepmd_horovod/bin/activate
srun horovodrun -np 2 dp train input.json

  • No labels