SAXPY on GPUs with Julia
What is SAXPY?
SAXPY stand for single-precision (Float32
) "a
times x
plus y
" where a
is a constant and x
and y
are vectors.
Basic CPU Implementation
const dim = 100_000_000
const a = 3.1415
# allocate vectors
x = ones(Float32, dim)
y = ones(Float32, dim)
z = zeros(Float32, dim)
# perform SAXPY on CPU
z .= a .* x .+ y
Three ways to SAXPY on GPU
The high-level way: Broadcasting
using CUDA
const dim = 100_000_000
const a = 3.1415
# allocate vectors on GPU
x = CUDA.ones(Float32, dim)
y = CUDA.ones(Float32, dim)
z = CUDA.zeros(Float32, dim)
# perform SAXPY on GPU
CUDA.@sync z .= a .* x .+ y
The library way: CUBLAS
using CUDA
using CUDA.CUBLAS
const dim = 100_000_000
const a = 3.1415
# allocate vectors on the GPU
x = CUDA.ones(Float32, dim)
y = CUDA.ones(Float32, dim)
# perform SAXPY on GPU
# (y is overwritten with the result)
CUDA.@sync CUBLAS.axpy!(dim, a, x, y)
The kernel way: Custom CUDA kernel
Performance Comparison
Since SAXPY is memory bound, we take the achieved GPU memory bandwidth as a performance proxy. As we can see below, all three implementations perform reasonably well with the CUBLAS and broadcasting variants being practically undistinguishable and close to the theoretical maximum.
NVIDIA V100
NVIDIA A100
Â
Slides for HPC.NRW
Â