Basic Tutorial

Step 1: Get an interactive session on a GPU node

srun -A <YOUR_PROJECT> -N 1 -n 128 --exclusive --gres=gpu:a100:4 -p gpu -t 1:00:00 --pty bash

Step 2: Load the JuliaHPC module

ml lang
ml JuliaHPC

Step 3: Install CUDA.jl in a local Julia project environment

mkdir jlcuda
cd jlcuda
julia --project=.

Once the Julia REPL pops up:

] add CUDA
# wait until it finishes
# switch back to julia> prompt by hitting backspace
using CUDA
CUDA.versioninfo()

As of April 2022, you should see an output like this:

julia> CUDA.versioninfo()
CUDA toolkit 11.6, local installation
NVIDIA driver 510.47.3, for CUDA 11.6
CUDA driver 11.6

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false

4 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)

Note that CUDA.jl is automatically using the local CUDA installation (because the JuliaHPC exports the necessary environment variables) and all four NVIDIA A100 GPUs are detected.

Step 4: Run a matrix multiplication on one of the NVIDIA A100 GPUs

A = rand(1000,1000);
B = rand(1000,1000)
@time A*B; # 0.566672 seconds (2.51 M allocations: 132.536 MiB, 4.50% gc time, 92.09% compilation time)
@time A*B; # 0.040360 seconds (2 allocations: 7.629 MiB)

Agpu = CuArray(A); # move matrix to gpu
Bgpu = CuArray(B); # move matrix to gpu
@time Agpu*Bgpu; # 5.059131 seconds (1.32 M allocations: 70.055 MiB, 0.36% gc time, 12.74% compilation time)
@time Agpu*Bgpu; # 0.000267 seconds (32 allocations: 640 bytes)

Notice that the multiplication is much faster on the GPU.

CUDA-aware OpenMPI

Allows you to send GPU arrays (i.e. CuArray from CUDA.jl) via point-to-point and collective MPI operations.

Example

Proceed as in the basic CUDA tutorial above but also ] add MPI , i.e. install MPI.jl next to CUDA.jl. After using MPI you can then use MPI.has_cuda() to check whether the used MPI has been compiled with CUDA support (always the case for the OpenMPI provided by the JuliaHPC module).

You should now be able to run the following code (if stored in a file cuda_mpi_test.jl) from the shell via mpirun -n 5 julia --project cuda_mpi_test.jl.

Code

# cuda_mpi_test.jl
using MPI
using CUDA

MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)

dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")

# allocate memory on the GPU
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))

# pass GPU buffers (CuArrays) into MPI functions
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")

Output

rank=4, size=5, dst=0, src=3
rank=1, size=5, dst=2, src=0
rank=2, size=5, dst=3, src=1
rank=0, size=5, dst=1, src=4
rank=3, size=5, dst=4, src=2
recv_mesg on proc 2: [1.0, 1.0, 1.0, 1.0]
recv_mesg on proc 4: [3.0, 3.0, 3.0, 3.0]
recv_mesg on proc 3: [2.0, 2.0, 2.0, 2.0]
recv_mesg on proc 1: [0.0, 0.0, 0.0, 0.0]
recv_mesg on proc 0: [4.0, 4.0, 4.0, 4.0]

Useful References

An Introduction to CUDA-Aware MPI (NVIDIA blog post)
https://juliaparallel.org/MPI.jl/latest/usage/#CUDA-aware-MPI-support and https://juliaparallel.org/MPI.jl/latest/knownissues/#CUDA-aware-MPI
Source of CUDA-aware MPI code example

Using NVIDIA GPUs with Julia