Basic Tutorial
Step 1: Get an interactive session on a GPU node
srun -A <YOUR_PROJECT> -N 1 -n 128 --exclusive --gres=gpu:a100:4 -p gpu -t 1:00:00 --pty bash
Step 2: Load the JuliaHPC module
ml lang ml JuliaHPC
Step 3: Install CUDA.jl in a local Julia project environment
mkdir jlcuda cd jlcuda julia --project=.
Once the Julia REPL pops up:
] add CUDA # wait until it finishes # switch back to julia> prompt by hitting backspace using CUDA CUDA.versioninfo()
As of April 2022, you should see an output like this:
julia> CUDA.versioninfo() CUDA toolkit 11.6, local installation NVIDIA driver 510.47.3, for CUDA 11.6 CUDA driver 11.6 Libraries: - CUBLAS: 11.8.1 - CURAND: 10.2.9 - CUFFT: 10.7.0 - CUSOLVER: 11.3.2 - CUSPARSE: 11.7.1 - CUPTI: 16.0.0 - NVML: 11.0.0+510.47.3 - CUDNN: missing - CUTENSOR: missing Toolchain: - Julia: 1.7.2 - LLVM: 12.0.1 - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0 - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80 Environment: - JULIA_CUDA_MEMORY_POOL: none - JULIA_CUDA_USE_BINARYBUILDER: false 4 devices: 0: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available) 1: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available) 2: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available) 3: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
Note that CUDA.jl is automatically using the local CUDA installation (because the JuliaHPC exports the necessary environment variables) and all four NVIDIA A100 GPUs are detected.
Step 4: Run a matrix multiplication on one of the NVIDIA A100 GPUs
A = rand(1000,1000); B = rand(1000,1000) @time A*B; # 0.566672 seconds (2.51 M allocations: 132.536 MiB, 4.50% gc time, 92.09% compilation time) @time A*B; # 0.040360 seconds (2 allocations: 7.629 MiB) Agpu = CuArray(A); # move matrix to gpu Bgpu = CuArray(B); # move matrix to gpu @time Agpu*Bgpu; # 5.059131 seconds (1.32 M allocations: 70.055 MiB, 0.36% gc time, 12.74% compilation time) @time Agpu*Bgpu; # 0.000267 seconds (32 allocations: 640 bytes)
Notice that the multiplication is much faster on the GPU.
CUDA-aware OpenMPI
Allows you to send GPU arrays (i.e. CuArray
from CUDA.jl) via point-to-point and collective MPI operations.
Example
Proceed as in the basic CUDA tutorial above but also ] add MPI
, i.e. install MPI.jl next to CUDA.jl. After using MPI
you can then use MPI.has_cuda()
to check whether the used MPI has been compiled with CUDA support (always the case for the OpenMPI provided by the JuliaHPC module).
You should now be able to run the following code (if stored in a file cuda_mpi_test.jl
) from the shell via mpirun -n 5 julia --project cuda_mpi_test.jl
.
Code
# cuda_mpi_test.jl using MPI using CUDA MPI.Init() comm = MPI.COMM_WORLD rank = MPI.Comm_rank(comm) size = MPI.Comm_size(comm) dst = mod(rank+1, size) src = mod(rank-1, size) println("rank=$rank, size=$size, dst=$dst, src=$src") # allocate memory on the GPU N = 4 send_mesg = CuArray{Float64}(undef, N) recv_mesg = CuArray{Float64}(undef, N) fill!(send_mesg, Float64(rank)) # pass GPU buffers (CuArrays) into MPI functions MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm) println("recv_mesg on proc $rank: $recv_mesg")
Output
rank=4, size=5, dst=0, src=3 rank=1, size=5, dst=2, src=0 rank=2, size=5, dst=3, src=1 rank=0, size=5, dst=1, src=4 rank=3, size=5, dst=4, src=2 recv_mesg on proc 2: [1.0, 1.0, 1.0, 1.0] recv_mesg on proc 4: [3.0, 3.0, 3.0, 3.0] recv_mesg on proc 3: [2.0, 2.0, 2.0, 2.0] recv_mesg on proc 1: [0.0, 0.0, 0.0, 0.0] recv_mesg on proc 0: [4.0, 4.0, 4.0, 4.0]
Useful References
An Introduction to CUDA-Aware MPI (NVIDIA blog post)
https://juliaparallel.org/MPI.jl/latest/usage/#CUDA-aware-MPI-support and https://juliaparallel.org/MPI.jl/latest/knownissues/#CUDA-aware-MPI
Source of CUDA-aware MPI code example