SAXPY on GPUs with Julia

What is SAXPY?

SAXPY stand for single-precision (Float32) "a times x plus y" where a is a constant and x and y are vectors.

Basic CPU Implementation

const dim = 100_000_000 const a = 3.1415 # allocate vectors x = ones(Float32, dim) y = ones(Float32, dim) z = zeros(Float32, dim) # perform SAXPY on CPU z .= a .* x .+ y

Three ways to SAXPY on GPU

The high-level way: Broadcasting

using CUDA const dim = 100_000_000 const a = 3.1415 # allocate vectors on GPU x = CUDA.ones(Float32, dim) y = CUDA.ones(Float32, dim) z = CUDA.zeros(Float32, dim) # perform SAXPY on GPU CUDA.@sync z .= a .* x .+ y

The library way: CUBLAS

using CUDA using CUDA.CUBLAS const dim = 100_000_000 const a = 3.1415 # allocate vectors on the GPU x = CUDA.ones(Float32, dim) y = CUDA.ones(Float32, dim) # perform SAXPY on GPU # (y is overwritten with the result) CUDA.@sync CUBLAS.axpy!(dim, a, x, y)

The kernel way: Custom CUDA kernel

Performance Comparison

Since SAXPY is memory bound, we take the achieved GPU memory bandwidth as a performance proxy. As we can see below, all three implementations perform reasonably well with the CUBLAS and broadcasting variants being practically undistinguishable and close to the theoretical maximum.

NVIDIA V100

NVIDIA A100

 

Slides for HPC.NRW

Â