Using NVIDIA GPUs with Julia
Basic Tutorial
Step 1: Get an interactive session on a GPU node
srun -A <YOUR_PROJECT> -N 1 -n 128 --exclusive --gres=gpu:a100:4 -p gpu -t 1:00:00 --pty bash
Step 2: Load the JuliaHPC module
ml lang
ml JuliaHPC
Step 3: Install CUDA.jl in a local Julia project environment
mkdir jlcuda
cd jlcuda
julia --project=.
Once the Julia REPL pops up:
] add CUDA
# wait until it finishes
# switch back to julia> prompt by hitting backspace
using CUDA
CUDA.versioninfo()
As of April 2022, you should see an output like this:
julia> CUDA.versioninfo()
CUDA toolkit 11.6, local installation
NVIDIA driver 510.47.3, for CUDA 11.6
CUDA driver 11.6
Libraries:
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false
4 devices:
0: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
1: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
2: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
3: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
Note that CUDA.jl is automatically using the local CUDA installation (because the JuliaHPC exports the necessary environment variables) and all four NVIDIA A100 GPUs are detected.
Step 4: Run a matrix multiplication on one of the NVIDIA A100 GPUs
A = rand(1000,1000);
B = rand(1000,1000)
@time A*B; # 0.566672 seconds (2.51 M allocations: 132.536 MiB, 4.50% gc time, 92.09% compilation time)
@time A*B; # 0.040360 seconds (2 allocations: 7.629 MiB)
Agpu = CuArray(A); # move matrix to gpu
Bgpu = CuArray(B); # move matrix to gpu
@time Agpu*Bgpu; # 5.059131 seconds (1.32 M allocations: 70.055 MiB, 0.36% gc time, 12.74% compilation time)
@time Agpu*Bgpu; # 0.000267 seconds (32 allocations: 640 bytes)
Notice that the multiplication is much faster on the GPU.
CUDA-aware OpenMPI
Allows you to send GPU arrays (i.e. CuArray
from CUDA.jl) via point-to-point and collective MPI operations.
Example
Proceed as in the basic CUDA tutorial above but also ] add MPI
, i.e. install MPI.jl next to CUDA.jl. After using MPI
you can then use MPI.has_cuda()
to check whether the used MPI has been compiled with CUDA support. (The easiest way to get a CUDA-aware MPI that works with Julia is to use the JuliaHPC module and export OMPI_MCA_opal_cuda_support=1
.)
You should now be able to run the following code (if stored in a file cuda_mpi_test.jl
) from the shell via mpirun -n 5 julia --project cuda_mpi_test.jl
.
Code
# cuda_mpi_test.jl
using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
# allocate memory on the GPU
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
# pass GPU buffers (CuArrays) into MPI functions
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
Output
rank=4, size=5, dst=0, src=3
rank=1, size=5, dst=2, src=0
rank=2, size=5, dst=3, src=1
rank=0, size=5, dst=1, src=4
rank=3, size=5, dst=4, src=2
recv_mesg on proc 2: [1.0, 1.0, 1.0, 1.0]
recv_mesg on proc 4: [3.0, 3.0, 3.0, 3.0]
recv_mesg on proc 3: [2.0, 2.0, 2.0, 2.0]
recv_mesg on proc 1: [0.0, 0.0, 0.0, 0.0]
recv_mesg on proc 0: [4.0, 4.0, 4.0, 4.0]
Useful References
An Introduction to CUDA-Aware MPI (NVIDIA blog post)
https://juliaparallel.org/MPI.jl/latest/usage/#CUDA-aware-MPI-support and https://juliaparallel.org/MPI.jl/latest/knownissues/#CUDA-aware-MPI
Source of CUDA-aware MPI code example