Intel oneAPI Quick Start Guide

This guide will walk you through the six steps required to use the Intel oneAPI FPGA toolkit on Noctua 2.

1. Get the examples.

The latest Intel oneAPI-samples are available on github. The master branch of the repository is always under development for the next release and might be incompatible with the latest version installed on our systems. You can checkout a version specific branch that matches the oneAPI version you are going to use.

You can copy the files from our file system to speed-up the process with

cp -r /opt/software/FPGA/IntelFPGA/oneapi/24.1.0/oneAPI-samples .
cd oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile

or clone the repository and checkout the correct version with

git clone https://github.com/oneapi-src/oneAPI-samples.git
cd oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile

git checkout release/2024.1

Intel grouped the FPGA examples (directory oneAPI-samples/DirectProgramming/C++SYCL_FPGA) into

ReferenceDesigns, demonstrates the implementation highly optimized algorithms and applications on an FPGA
- anr: Adaptive Noise Reduction
- crr: CRR Binomial Tree Model for Option Pricing
- db: Database Query Acceleration
- gzip: GZIP Compression
- merge_sort: Merge Sort
- mvdr_beamforming: MVDR Beamforming
- qrd: QR Decomposition of Matrices
- qri: QR-based inversion of Matrices
Tutorials, which itself consist of
- DesignPatterns (double buffering, I/O streaming, loop optimizations, …)
- Features (loop unrolling, pipes, usage of pragmas, …)
- GettingStarted (our guide is based on this tutorial)
- Tools (collect data for profiling)

In this quick start we pick the GettingStarted example from the tutorials, but feel free to explore the other options.

2. Setup the local software environment on Noctua2.

module reset
module load fpga devel compiler
module load intel/oneapi/24.1.0
module load bittware/520n/20.4.0_hpc
module load CMake
module load GCC

With module reset, previously loaded modules are cleaned up. The first three modules loaded are the gateway modules to the actual modules loaded in lines 3-6. Without version number provided, the latest versions will be loaded. To use a specific version, you can append the version, e.g. intel/oneapi/24.1.0. With the given commands the following modules are loaded

intel/oneapi: software stack for oneAPI development
bittware_520n: drivers and board support package (BSP) for the Intel Stratix 10 card
CMake: helps with compilation and build
GCC: the GNU Compiler Collection

Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the Stratix 10 as target card. Observe for example:

echo $FPGA_BOARD_NAME
p520_hpc_sg280l

echo $AOCL_BOARD_PACKAGE_ROOT
/opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default

Supported oneAPI versions and bittware BSP versions:

mkdir build
cd build
cmake .. -DFPGA_DEVICE=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME

The build directory (for cmake) is created and configured with the correct target board (using the environment variables $AOCL_BOARD_PACKAGE_ROOT and $FPGA_BOARD_NAME populated via modules in the previous step).

Expected output

-- The CXX compiler identification is IntelLLVM 2024.1.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/software/FPGA/IntelFPGA/oneapi/24.1.0/compiler/2024.1/bin/icpx - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring the design with the following target: /opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default:p520_hpc_sg280l
-- Configuring done (1.7s)
-- Generating done (0.2s)
-- Build files have been written to: .../oneapi-24.1.0/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile/build

3. Build and test the example in emulation.

make fpga_emu

Builds the emulation binary called fast_recompile.fpga_emu.

Under the hood, make performs two main steps here (simplified from CMake generated calls)

Creating the object files from source files
- mkdir fast_recompile.fpga_emu.dir icpx -fsycl -Wall -fintelfpga -DFPGA_EMULATOR -o fast_recompile.fpga_emu.dir/kernel.cpp.o -c ../src/kernel.cpp icpx -fsycl -Wall -fintelfpga -DFPGA_EMULATOR -I ../../../../include/ -o fast_recompile.fpga_emu.dir/host.cpp.o -c ../src/host.cpp
the flag -fintelfpga instructs the compiler to compile for FPGA.
the flag -DFPGA_EMULATOR is used in the host code as a pre-processor macro to use either the emulator or real FPGA device for execution.
Linking the FPGA emulation binary
- icpx -fsycl -fintelfpga fast_recompile.fpga_emu.dir/host.cpp.o fast_recompile.fpga_emu.dir/kernel.cpp.o -o fast_recompile.fpga_emu
  - produces the emulation binary called fast_recompile.fpga_emu.

./fast_recompile.fpga_emu

Running on device: Intel(R) FPGA Emulation Device
PASSED: results are correct

Note, that the FPGA Emulation Device is selected.

Executes the emulation binary.

Note: the emulation in emulation gives no indication at all about the performance that is to be expected from hardware execution on a real FPGA.

4. Create and inspect reports as indicator of expected HW performance.

make report

Under the hood, make executes compilation and linking steps, which could be manually replicated as

mkdir fast_recompile_report.a.dir
icpx  -fsycl -fintelfpga -o fast_recompile_report.a.dir/kernel.cpp.o -c ../src/kernel.cpp
icpx  -fsycl -fintelfpga -I ../../../../include/ -o fast_recompile_report.a.dir/host.cpp.o -c ../src/host.cpp
icpx  -fsycl -fintelfpga -Xshardware -fsycl-link=early -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME fast_recompile_report.a.dir/host.cpp.o fast_recompile_report.a.dir/kernel.cpp.o -o fast_recompile_report.a

In addition to the compilation flags -fintelfpga and -Xshardware that we already used for compilation for emulation, we make use of

the -fsycl-link=early flag to instructs the compiler to generate the optimization reports and to stop afterwards
the -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME flag to tell the compiler the target FPGA card.

The generated report is a html file (called report.html) located in fast_recompile_report.prj/reports/ for our example.

the general location of the report is always <file_name>_report.prj/reports/report.html

With older tool versions, reports from this repository generated using cmake tend to have a problem with escape characters in one of the files. You can fix this problem with one of these commands.

sed -i 's/\\\\\\3/\\\\\\\\3/g; s/\\\\\\\\\"/\\\\\\\\\\\"/g' fast_recompile_report.prj/reports/resources/report_data.js

sed -i 's/\\\\\\3/\\\\\\\\3/g; s/\\\\\\\\\"/\\\\\\\\\\\"/g' fast_recompile.report.prj/reports/resources/report_data.js

Make sure the file was found correctly (no error message) or verify which path actually exists:

ls fast_recompile*report.prj/reports/resources/report_data.js

In order to inspect the report, you may want to copy the report to your local file system or mount your working directory, for more information refer to Data Management. For example you can compress the report on Noctua 2:

tar -caf fast_recompile_report.tar.gz fast_recompile_report.prj/reports

Then copy and decompress it from your local command line (e.g. Linux, MacOS, or Windows Subsystem for Linux):

rsync -azv -e 'ssh -J <your-username>@fe.noctua2.pc2.uni-paderborn.de' <your-username>@n2login2:/scratch/<DIRECTORY_ASSIGNED_TO_YOUR_PROJECT>/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile/build/fast_recompile_report.tar.gz .

tar -xzf fast_recompile_report.tar.gz

Open and inspect fast_recompile_report.prj/reports/report.html in your browser. The whole analysis contains little information, since the example is very simple. The main blocks of the report are

Throughput Analysis -> Loop Analysis: displays information about all loops and their optimization status (is it pipelined? what is the initiation interval (II) of the loop?, …).
Area Estimates: details about the area utilization with architectural details into the generated hardware.
Views -> System Viewer: gives an overall overview of your kernels, their connections between each other and to external resources like memory.
Views -> Kernel Memory Viewer: displays the data movement and synchronization in your code.
Views -> Schedule Viewer: shows the scheduling of the generated instructions with corresponding latencies.

The Area Analysis for our example shows that the kernel system uses at most 1% of the available resources, much more complex kernels could fit on the FPGA. The System Viewer shows two 32-bit Burst-coalesced load and one 32-bit Burst-coalesced store operations. Refer to Intel's documentation to learn more about the properties of the report and the optimization goals.

5. Build the hardware design (bitstream)

This hardware build step (so-called hardware synthesis) can take lots of time (hours!) and compute resources, so we create a batch script to submit the job to the slurm workload manager. Make sure to put your actual project acronym at the placeholder.

#!/bin/sh

# synthesis_script.sh script

#SBATCH -t 24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH -q fpgasynthesis
#SBATCH -A <your_project_acronym>
#SBATCH -p normal

module reset
module load fpga devel compiler
module load intel/oneapi/24.1.0
module load bittware/520n/20.4.0_hpc
module load CMake
module load GCC

make fpga

Then, we submit the synthesis_script.sh to the slurm workload manager:

sbatch synthesis_script.sh

With --mem=32G, we allocate a small amount of main memory to this synthesis job, corresponding to the very small example we build here. For larger designs, typically at least 64G will be needed.
With --cpus-per-task=8, we use more cores to parallelize the synthesis
Using -q fpgasynthesis gives the FPGA bitstream-synthesis jobs a higher priority, see Quality-of-Service (QoS) and Job Priorities.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

Under the hood, the make performs a step that we already know from the emulation and report generation

icpx  -fsycl -fintelfpga -Xshardware -fsycl-link=image -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME fast_recompile.fpga.dir/host.cpp.o fast_recompile.fpga.dir/kernel.cpp.o -o fast_recompile.fpga

the only difference is the -fsycl-link=image flag. It instructs the compiler to perform the full hardware synthesis and not to stop after the optimization report.
the binary fast_recompile.fpga is generated. It will handle the initialization and execution of the code on an FPGA (see next step).

Expected output

[ 33%] Generating dev_image.a
aoc: Compiling for FPGA. This process may take several hours to complete.  Prior to performing this compile, be sure to check the reports to ensure the design will meet your performance targets.  If the reports indicate performance targets are not being met, code edits may be required.  Please refer to the oneAPI FPGA Optimization Guide for information on performance tuning applications for FPGAs.
[ 66%] Generating host.o
[100%] Generating fast_recompile.fpga
[100%] Built target fpga

Note, that the build of the hardware design will create another report similar to the report that we discussed in the previous step. In contrast to the previous report, the new report contains the actual resource utilization of the design. More details on the analysis of the actual image can be found in Intel’s documentation.

To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.

In order to still use the slurm workload manager, we use a modified batch script copy_pre-synthesed_design_script.sh and submit it.

#!/bin/sh

# copy_pre-synthesed_design_script.sh

# Instead of starting the actual synthesis with `make build TARGET=hw`,
# we extract the result from an archive.
tar -xvf /opt/software/FPGA/IntelFPGA/oneapi/24.1.0/pre-synthesized/fast_recompile_-_20.4.0_hpc_-_24.1.0.tar.gz

Then, we submit the copy_pre-synthesed_design_script.sh to the slurm workload manager:

sbatch --partition=normal -A <your_project_acronym> -t 00:10:00 ./copy_pre-synthesed_design_script.sh

Note that in this example, we pass the relevant job parameters directly to sbatch instead of encoding them in the jobscript as done in the synthesis_script.sh example. For many purposes, both variants are equally suitable.
We submit into --partition=normal.
With -t 00:10:00, we allocate a small amount of time to this file copy job.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

6. Execute the hardware design on an FPGA.

After the hardware synthesis, we can allocate a suitably configured and equipped FPGA node and for execution.

srun --partition=fpga -A <your_project_acronym> --constraint=bittware_520n_20.4.0_hpc -t 2:00:00 --pty bash

To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node

module reset
module load fpga devel compiler
module load intel/oneapi/24.1.0
module load bittware/520n/20.4.0_hpc
module load CMake
module load GCC

./fast_recompile.fpga

Running on device: p520_hpc_sg280l : BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie0)
MMD INFO : Disabling SmartVID (fix) polling
MMD INFO : Enabling SmartVID (fix) polling
PASSED: results are correct

How to proceed

For more information using the tools, refer to

DPC++ FPGA Code Samples Guide,
Intel’s oneAPI Programming Guide and especially the FPGA flow,
the FPGA optimization guide,
and finally open-access book on Data Parallel C++.

PC2-Documentation