Using HACC Nodes

Next to our main FPGA installation in Noctua 2, the cluster also contains three nodes provided by AMD as part of the HACC initiative. Each node consists of

2x Xilinx Alveo U55C cards
2x Xilinx VCK5000 Versal development cards
4x AMD Instinct MI210 GPU cards

In contrast to the main fpga partition, the FPGA boards are configured with a fixed shells (not selectable by the user) and XRT version 2.16 is installed.

1 Allocate HACC Node
2 Software Modules
- 2.1 FPGAs
  - 2.1.1 Additional Notes on the Use of AI Engines
- 2.2 MI210 GPUs
  - 2.2.1 ROCm (Radeon Open Compute)
  - 2.2.2 AdaptiveCPP

Allocate HACC Node

The three HACC nodes are contained in a separate hacc partition. To be able to submit jobs to this partition, your compute project needs to be enabled to use the HACC resources. Please send a brief mail to pc2-support@uni-paderborn.de stating your compute project and what resources you would like to use.

The HACC nodes are handed out exclusively, i.e., sharing a node between multiple jobs is not possible. Because the shell is not user-selectable, nodes can be allocated without any constraint:

[tester@n2login1 ~]$ srun -p hacc -A $YOUR_PROJECT_ID -t 00:10:00 --pty bash
[...]

# Show available FPGAs
[tester@n2hacc03 ~]$ /opt/xilinx/xrt/bin/xbutil examine
System Configuration
  OS Name              : Linux
  Release              : 4.18.0-477.51.1.el8_8.x86_64
  Version              : #1 SMP Fri Mar 1 11:21:44 EST 2024
  Machine              : x86_64
  CPU Cores            : 128
  Memory               : 515287 MB
  Distribution         : Red Hat Enterprise Linux 8.8 (Ootpa)
  GLIBC                : 2.28
  Model                : AS -4124GS-TNR

XRT
  Version              : 2.16.204
  Branch               : 2023.2
  Hash                 : fa4c0045003fed0acea4593788dce5ef6d0b66ee
  Hash Date            : 2023-10-12 06:45:18
  XOCL                 : 2.16.204, fa4c0045003fed0acea4593788dce5ef6d0b66ee
  XCLMGMT              : 2.16.204, fa4c0045003fed0acea4593788dce5ef6d0b66ee

Devices present
BDF             :  Shell                              Logic UUID                            Device ID       Device Ready*  
---------------------------------------------------------------------------------------------------------------------------
[0000:81:00.1]  :  xilinx_u55c_gen3x16_xdma_base_3    97088961-FEAE-DA91-52A2-1D9DFD63CCEF  user(inst=134)  Yes            
[0000:a1:00.1]  :  xilinx_vck5000_gen4x8_qdma_base_2  05DCA096-76CB-730B-8D19-EC1192FBAE3F  user(inst=135)  Yes            
[0000:c1:00.1]  :  xilinx_u55c_gen3x16_xdma_base_3    97088961-FEAE-DA91-52A2-1D9DFD63CCEF  user(inst=133)  Yes            
[0000:e1:00.1]  :  xilinx_vck5000_gen4x8_qdma_base_2  05DCA096-76CB-730B-8D19-EC1192FBAE3F  user(inst=132)  Yes            


* Devices that are not ready will have reduced functionality when using XRT tools

# Show available GPUs.
[tester@n2hacc03 ~]$ /usr/bin/rocm-smi

========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                   
====================================================================================================================
0       10    0x740f,   12261  44.0°C  38.0W  N/A, N/A, 0         800Mhz  1600Mhz  0%   auto  300.0W  0%     0%    
1       11    0x740f,   42047  41.0°C  42.0W  N/A, N/A, 0         800Mhz  1600Mhz  0%   auto  300.0W  0%     0%    
2       9     0x740f,   57300  41.0°C  42.0W  N/A, N/A, 0         800Mhz  1600Mhz  0%   auto  300.0W  0%     0%    
3       8     0x740f,   1997   38.0°C  42.0W  N/A, N/A, 0         800Mhz  1600Mhz  0%   auto  300.0W  0%     0%    
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================

In the example output you can also see

xbutil examine: the four FPGA boards, along with their user BDFs. Make sure to select the correct device type in your host code.
rocm-smi: the four MI210 GPUs.

Software Modules

FPGAs

Depending on what kind of board you are targeting, you need to load the matching FPGA shell module. Note that by default the module for Noctua 2’s Alveo U280 boards will be loaded. On the HACC nodes, you need to swap it against the one for your target board:

module load fpga
module load xilinx/xrt/2.16

# Use one of the following commands to swap the shell module against the one you need
module swap xilinx/u280 xilinx/u55c
# OR
module swap xilinx/u280 xilinx/vck5000

Additional Notes on the Use of AI Engines

The VCK5000 boards contain AI Engines that can be programmed using Vitis. To use the AIE compiler and the AIE simulator, additional modules and licenses are required:

Vitis 23.2, which is automatically loaded with XRT 2.16, is currently not able to perform hardware synthesis for the VCK5000 board. If you target hardware synthesis, you may load the module xilinx/xrt/2.15 instead, or swap out Vitis against an older version:
ml swap xilinx/vitis/23.2 xilinx/vitis/23.1
In addition to the modules listed above, you may need to load the Graphviz module:
module load vis module load Graphviz
The AIE compiler and simulator require separate software licenses that are not included in Vitis. If you need help in acquiring these licenses from AMD, please get in touch with us.

MI210 GPUs

ROCm (Radeon Open Compute)

Load modules

ml devel/CMake/3.29.3-GCCcore-13.3.0
ml system/ROCM/6.2.4

Get ROCm examples

git clone https://github.com/ROCm/rocm-examples.git

Build rocm-examples/HIP-Basic/bandwidth example

cd rocm-examples/HIP-Basic/bandwidth

mkdir build
cd build

cmake ..

make

Run example on all four GPU cards

$ ./hip_bandwidth -device 0
Devices: [ 0 ]
Measurement Sizes: [ 1048576, 5242880 ]

Measuring Host to Device Bandwidth: [1048576] [5242880] 

Device ID [0] Device Name [AMD Instinct MI210]: Paged Bandwidth Host to Device (GB/s): [ 4.50733, 24.1042 ]

Measuring Device to Host Bandwidth: [1048576] [5242880] 

Device ID [0] Device Name [AMD Instinct MI210]: Paged Bandwidth Device to Host (GB/s): [ 6.83549, 24.1032 ]

Measuring Device to Device Bandwith: [1048576] [5242880] 

Device ID [0] Device Name [AMD Instinct MI210]: Bandwidth Device to Device (GB/s): [ 324.518, 504.69 ]


$  ./hip_bandwidth -device 1
Devices: [ 1 ]
Measurement Sizes: [ 1048576, 5242880 ]

Measuring Host to Device Bandwidth: [1048576] [5242880] 

Device ID [1] Device Name [AMD Instinct MI210]: Paged Bandwidth Host to Device (GB/s): [ 5.9023, 24.0881 ]

Measuring Device to Host Bandwidth: [1048576] [5242880] 

Device ID [1] Device Name [AMD Instinct MI210]: Paged Bandwidth Device to Host (GB/s): [ 6.70966, 23.9753 ]

Measuring Device to Device Bandwith: [1048576] [5242880] 

Device ID [1] Device Name [AMD Instinct MI210]: Bandwidth Device to Device (GB/s): [ 321.947, 505.196 ]


$  ./hip_bandwidth -device 2
Devices: [ 2 ]
Measurement Sizes: [ 1048576, 5242880 ]

Measuring Host to Device Bandwidth: [1048576] [5242880] 

Device ID [2] Device Name [AMD Instinct MI210]: Paged Bandwidth Host to Device (GB/s): [ 4.59211, 24.0885 ]

Measuring Device to Host Bandwidth: [1048576] [5242880] 

Device ID [2] Device Name [AMD Instinct MI210]: Paged Bandwidth Device to Host (GB/s): [ 6.55717, 23.6417 ]

Measuring Device to Device Bandwith: [1048576] [5242880] 

Device ID [2] Device Name [AMD Instinct MI210]: Bandwidth Device to Device (GB/s): [ 322.999, 509.685 ]


$  ./hip_bandwidth -device 3
Devices: [ 3 ]
Measurement Sizes: [ 1048576, 5242880 ]

Measuring Host to Device Bandwidth: [1048576] [5242880] 

Device ID [3] Device Name [AMD Instinct MI210]: Paged Bandwidth Host to Device (GB/s): [ 4.55983, 24.1784 ]

Measuring Device to Host Bandwidth: [1048576] [5242880] 

Device ID [3] Device Name [AMD Instinct MI210]: Paged Bandwidth Device to Host (GB/s): [ 6.54802, 23.4511 ]

Measuring Device to Device Bandwith: [1048576] [5242880] 

Device ID [3] Device Name [AMD Instinct MI210]: Bandwidth Device to Device (GB/s): [ 326.194, 504.564 ]

AdaptiveCPP

Load modules

ml compiler/AdaptiveCpp/24.10.0-GCCcore-13.3.0

Get examples (here oneAPI-examples)

git clone https://github.com/oneapi-src/oneAPI-samples.git

Build vector-add example with acpp compiler

cd oneAPI-samples/DirectProgramming/C++SYCL/DenseLinearAlgebra/vector-add

# Build using AdaptiveCPP acpp compiler.
acpp -O3 src/vector-add-buffers.cpp -o vector-add-buffers

Please note

we are using the generic LLVM JIT compiler, if we do not explicitly specify --acpp-targets. This is AdaptiveCpp's default, most portable and usually most performant compilation flow.
the generated binary is usable across various backend devices.

Run same binary on various backend devices:

On AMD MI210 (by allocating a node in thehaccpartition)

srun -t 00:02:00 -A $YOUR_PROJECT_ID -p hacc ./vector-add-buffers 
[...]

[AdaptiveCpp Warning] backend_loader: Could not load library: /opt/software/pc2/EB-SW/software/AdaptiveCpp/24.10.0-GCCcore-13.3.0/bin/../lib/hipSYCL/librt-backend-cuda.so
[AdaptiveCpp Warning] libcuda.so.1: cannot open shared object file: No such file or directory
XRT build version: 2.16.204
Build hash: fa4c0045003fed0acea4593788dce5ef6d0b66ee
Build date: 2023-10-12 06:45:18
Git branch: 2023.2
PID: 3325191
UID: 46476
[Thu Feb  6 09:10:59 2025 GMT]
HOST: n2hacc02
EXE: /scratch/pc2-mitarbeiter/deffel/hacc/oneAPI-samples/DirectProgramming/C++SYCL/DenseLinearAlgebra/vector-add/vector-add-buffers
[XRT] ERROR: XILINX_XRT must be set
Running on device: AMD Instinct MI210                          
Vector size: 10000
[AdaptiveCpp Warning] This application uses SYCL buffers; the SYCL buffer-accessor model is well-known to introduce unnecessary overheads. Please consider migrating to the SYCL2020 USM model, in particular device USM (sycl::malloc_device) combined with in-order queues for more performance. See the AdaptiveCpp performance guide for more information: 
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/performance.md
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
Vector add successfully completed on device.

Note the Running on device output in line 16 is AMD Instinct MI210.

On NVIDIA A100 GPU (by allocating a single A100 GPU in thegpupartition)

srun -t 00:02:00 -A $YOUR_PROJECT_ID -p gpu -N 1 -n 1 --gres=gpu:a100:1 ./vector-add-buffers
[...]

Running on device: NVIDIA A100-SXM4-40GB
Vector size: 10000
[AdaptiveCpp Warning] This application uses SYCL buffers; the SYCL buffer-accessor model is well-known to introduce unnecessary overheads. Please consider migrating to the SYCL2020 USM model, in particular device USM (sycl::malloc_device) combined with in-order queues for more performance. See the AdaptiveCpp performance guide for more information: 
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/performance.md
[AdaptiveCpp Warning] kernel_cache: This application run has resulted in new binaries being JIT-compiled. This indicates that the runtime optimization process has not yet reached peak performance. You may want to run the application again until this warning no longer appears to achieve optimal performance.
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
Vector add successfully completed on device.

Note the Running on device output in line 4 is NVIDIA A100-SXM4-40GB.

On host CPU (AMD EPYC 7763 processor in normal partition)

srun -t 00:02:00 -A $YOUR_PROJECT_ID -p normal -N 1 -n 1 ./vector-add-buffers
[...]

[AdaptiveCpp Warning] backend_loader: Could not load library: /opt/software/pc2/EB-SW/software/AdaptiveCpp/24.10.0-GCCcore-13.3.0/bin/../lib/hipSYCL/librt-backend-cuda.so
[AdaptiveCpp Warning] libcuda.so.1: cannot open shared object file: No such file or directory
Running on device: AdaptiveCpp OpenMP host device
Vector size: 10000
[AdaptiveCpp Warning] from /dev/shm/deffel/AdaptiveCpp/24.10.0/GCCcore-13.3.0/AdaptiveCpp/src/runtime/ocl/ocl_hardware_manager.cpp:596 @ ocl_hardware_manager(): ocl_hardware_manager: Could not obtain platform list (error code = CL:-1001)
[AdaptiveCpp Warning] This application uses SYCL buffers; the SYCL buffer-accessor model is well-known to introduce unnecessary overheads. Please consider migrating to the SYCL2020 USM model, in particular device USM (sycl::malloc_device) combined with in-order queues for more performance. See the AdaptiveCpp performance guide for more information: 
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/performance.md
[0]: 0 + 0 = 0
[1]: 1 + 1 = 2
[2]: 2 + 2 = 4
...
[9999]: 9999 + 9999 = 19998
Vector add successfully completed on device.

Note the Running on device output in line 6 is AdaptiveCpp OpenMP host device indicating the host CPU system.