/
Using NVIDIA GPUs with Julia

Using NVIDIA GPUs with Julia

 

Basic Tutorial

Step 1: Get an interactive session on a GPU node

srun -A <YOUR_PROJECT> -N 1 -n 128 --exclusive --gres=gpu:a100:4 -p gpu -t 1:00:00 --pty bash

Step 2: Load the JuliaHPC module

ml lang ml JuliaHPC

Step 3: Install CUDA.jl in a local Julia project environment

mkdir jlcuda cd jlcuda julia --project=.

Once the Julia REPL pops up:

] add CUDA # wait until it finishes # switch back to julia> prompt by hitting backspace using CUDA CUDA.versioninfo()

As of April 2022, you should see an output like this:

julia> CUDA.versioninfo() CUDA toolkit 11.6, local installation NVIDIA driver 510.47.3, for CUDA 11.6 CUDA driver 11.6 Libraries: - CUBLAS: 11.8.1 - CURAND: 10.2.9 - CUFFT: 10.7.0 - CUSOLVER: 11.3.2 - CUSPARSE: 11.7.1 - CUPTI: 16.0.0 - NVML: 11.0.0+510.47.3 - CUDNN: missing - CUTENSOR: missing Toolchain: - Julia: 1.7.2 - LLVM: 12.0.1 - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0 - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80 Environment: - JULIA_CUDA_MEMORY_POOL: none - JULIA_CUDA_USE_BINARYBUILDER: false 4 devices: 0: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available) 1: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available) 2: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available) 3: NVIDIA A100-SXM4-40GB (sm_80, 39.405 GiB / 40.000 GiB available)

Note that CUDA.jl is automatically using the local CUDA installation (because the JuliaHPC exports the necessary environment variables) and all four NVIDIA A100 GPUs are detected.

Step 4: Run a matrix multiplication on one of the NVIDIA A100 GPUs

A = rand(1000,1000); B = rand(1000,1000) @time A*B; # 0.566672 seconds (2.51 M allocations: 132.536 MiB, 4.50% gc time, 92.09% compilation time) @time A*B; # 0.040360 seconds (2 allocations: 7.629 MiB) Agpu = CuArray(A); # move matrix to gpu Bgpu = CuArray(B); # move matrix to gpu @time Agpu*Bgpu; # 5.059131 seconds (1.32 M allocations: 70.055 MiB, 0.36% gc time, 12.74% compilation time) @time Agpu*Bgpu; # 0.000267 seconds (32 allocations: 640 bytes)

Notice that the multiplication is much faster on the GPU.

CUDA-aware OpenMPI

Allows you to send GPU arrays (i.e. CuArray from CUDA.jl) via point-to-point and collective MPI operations.

Example

Proceed as in the basic CUDA tutorial above but also ] add MPI , i.e. install MPI.jl next to CUDA.jl. After using MPI you can then use MPI.has_cuda() to check whether the used MPI has been compiled with CUDA support. (The easiest way to get a CUDA-aware MPI that works with Julia is to use the JuliaHPC module and export OMPI_MCA_opal_cuda_support=1.)

You should now be able to run the following code (if stored in a file cuda_mpi_test.jl) from the shell via mpirun -n 5 julia --project cuda_mpi_test.jl.

Code

# cuda_mpi_test.jl using MPI using CUDA MPI.Init() comm = MPI.COMM_WORLD rank = MPI.Comm_rank(comm) size = MPI.Comm_size(comm) dst = mod(rank+1, size) src = mod(rank-1, size) println("rank=$rank, size=$size, dst=$dst, src=$src") # allocate memory on the GPU N = 4 send_mesg = CuArray{Float64}(undef, N) recv_mesg = CuArray{Float64}(undef, N) fill!(send_mesg, Float64(rank)) # pass GPU buffers (CuArrays) into MPI functions MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm) println("recv_mesg on proc $rank: $recv_mesg")

Output

rank=4, size=5, dst=0, src=3 rank=1, size=5, dst=2, src=0 rank=2, size=5, dst=3, src=1 rank=0, size=5, dst=1, src=4 rank=3, size=5, dst=4, src=2 recv_mesg on proc 2: [1.0, 1.0, 1.0, 1.0] recv_mesg on proc 4: [3.0, 3.0, 3.0, 3.0] recv_mesg on proc 3: [2.0, 2.0, 2.0, 2.0] recv_mesg on proc 1: [0.0, 0.0, 0.0, 0.0] recv_mesg on proc 0: [4.0, 4.0, 4.0, 4.0]

Useful References

Related content