Intel OpenCL Quick Start Guide

This guide will walk you walk you through the six steps required to use the Intel oneAPI FPGA toolkit on Noctua 2.

1. Get the latest examples.

git clone https://github.com/oneapi-src/oneAPI-samples.git
cd oneAPI-samples/DirectProgramming/DPC++FPGA/Tutorials/GettingStarted/fpga_compile

Details

Intel grouped the FPGA examples (directory oneAPI-samples/DirectProgramming/DPC++FPGA) into

ReferenceDesigns, demonstrates the implementation highly optimized algorithms and applications on an FPGA
- anr: Adaptive Noise Reduction
- crr: CRR Binomial Tree Model for Option Pricing
- db: Database Query Acceleration
- gzip: GZIP Compression
- merge_sort: Merge Sort
- mvdr_beamforming: MVDR Beamforming
- qrd: QR Decomposition of Matrices
- qri: QR-based inversion of Matrices
Tutorials, which itself consist of
- DesignPatterns (double buffering, I/O streaming, loop optimizations, …)
- Features (loop unrolling, pipes, usage of pragmas, …)
- GettingStarted (our guide is based on this tutorial)
- Tools (collect data for profiling)

In this quick start we pick the GettingStarted example from the tutorials, but feel free to explore the other options.

2. Setup the local software environment on Noctua2.

module load intel/oneAPI
module load bittware_520n
module load devel/CMake

Details

Without version number provided, the latest versions will be loaded. To use a specific version, you can append the version, e.g. intel/oneAPI/22.1. With the given commands the following modules are loaded

intel/oneAPI: software stack for oneAPI development
bittware_520n: drivers and board support package (BSP) for the Intel Stratix 10 card
CMake: helps with compilation and build

Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the Stratix 10 as target card. Observe for example:

echo $FPGA_BOARD_NAME
p520_hpc_sg280l

echo $AOCL_BOARD_PACKAGE_ROOT
/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10_hpc_default

If you have a project that was only validated with an older BSP, you can explicitly load the module for an older version of xrt, e.g. bittware_520n/19.4.0_hpc.

Supported oneAPI versions and bittware BSPversions:

TBD

mkdir build
cd build
cmake .. -DFPGA_BOARD=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME

Details

The build directory (for cmake) is created and configured with the correct target board (using the environment variables $AOCL_BOARD_PACKAGE_ROOT and $FPGA_BOARD_NAME populated via modules in the previous step).

Expected output

-- The CXX compiler identification is IntelLLVM 2022.0.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /cm/shared/opt/intel_oneapi/2022.1/compiler/2022.0.1/linux/bin/dpcpp - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring the design to run on FPGA board /cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10_hpc_default:p520_hpc_sg280l
-- Configuring done
-- Generating done
-- Build files have been written to: .../oneAPI-samples/DirectProgramming/DPC++FPGA/Tutorials/GettingStarted/fpga_compile/build

3. Build and test the example in emulation.

make fpga_emu

Builds the emulation binary called fpga_compile.fpga_emu.

Details

Under the hood, the make performs two main steps here

Creating the object files from source files
- dpcpp -Wall -fintelfpga -DFPGA_EMULATOR ... -o .../fpga_compile.cpp.o .../fpga_compile.cpp
  - the flag -fintelfpga instructs the compiler to compile for FPGA.
  - the flag -DFPGA_EMULATOR is used in the host code as a pre-processor macro to use either the emulator or real FPGA device for execution.
Linking the FPGA emulation binary
- dpcpp -fintelfpga CMakeFiles/fpga_compile.fpga_emu.dir/fpga_compile.cpp.o -o ../fpga_compile.fpga_emu
  - produces the emulation binary called fpga_compile.fpga_emu.

./fpga_compile.fpga_emu

Expected output

Running on device: Intel(R) FPGA Emulation Device
PASSED: results are correct

Note, that the FPGA Emulation Device is selected.

Executes the emulation binary.

Note: the emulation in emulation gives no indication at all about the performance that is to be expected from hardware execution on a real FPGA.

4. Create and inspect reports as indicator of expected HW performance.

dpcpp -fintelfpga -Xshardware -fsycl-link=early -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME ../src/fpga_compile.cpp -o fpga_compile_report.a

Details

In addition to the compilation flags -fintelfpga and -Xshardware that we already used for compilation for emulation, we make use of

the -fsycl-link=early flag to instructs the compiler to generate the optimization reports and to stop afterwards
the -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME flag to tell the compiler the target FPGA card.

The generated report is a html file (called report.html) located in fpga_compile_report.prj/reports/ for our example.

the general location of the report is always <file_name>.prj/reports/report.html

In order to inspect the report, you may want to copy the report to your local file system or mount your working directory, for more information refer to [Noctua2-FileSystems]. For example you can compress the report on Noctua 2:

tar -caf fpga_compile_report.tar.gz fpga_compile_report.prj/reports

Then copy and decompress it from your local command line (e.g. Linux, MacOS, or Windows Subsystem for Linux):

TDB
rsync -azv -e 'ssh -J <your-username>@fe.noctua.pc2.uni-paderborn.de' <your-username>@ln-0001:/scratch/<DIRECTORY_ASSIGNED_TO_YOUR_PROJECT>/getting_started_with_fpgas/vector_add/vector_add_report.tar.gz .

tar -xzf vector_add_report.tar.gz

Open and inspect fpga_compile_report.prj/reports/report.html in your browser. The whole analysis contains little information, since the example is very simple. The main blocks of the report are

Throughput Analysis -> Loop Analysis: displays information about all loops and their optimization status (is it pipelined? what is the initiation interval (II) of the loop?, …).
Area Analysis (of System): details about the area utilization with architectural details into the generated hardware.
Views -> System Viewer: gives an overall overview of your kernels, their connections between each other and to external resources like memory.
Views -> Kernel Memory Viewer: displays the data movement and synchronization in your code.
Views -> Schedule Viewer: shows the scheduling of the generated instructions with corresponding latencies.
Bottleneck Viewer: identifies bottlenecks that reduce the performance of the design (lower maximum clock frequency of the design (F_max), increases the initiation interval (II), …).

The Area Analysis for our example shows that the kernel system uses at most 1% of the available resources, much more complex kernels could fit on the FPGA. The System Viewer shows two 32-bit Burst-coalesced load and one 32-bit Burst-coalesced store operations. Refer to Intel's documentation to learn more about the properties of the report and the optimization goals.

5. Build the hardware design (bitstream)

This hardware build step (so-called hardware synthesis) can take lots of time (hours!) and compute resources, so we create a batch script to submit the job to the slurm workload manager.

#!/bin/sh

# synthesis_script.sh script

module load intel/oneAPI
module load bittware_520n
module load devel/CMake

make fpga

Then, we submit the synthesis_script.sh to the slurm workload manager:

sbatch --partition=fpgasyn -A <your_project_acronym> --mem=32G -t 24:00:00 ./synthesis_script.sh

Details and expected output with annotations

With --mem=32G, we allocate a small amount of main memory to this synthesis job, corresponding to the very small example we build here. For larger designs, typically at least 64G will be needed.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

Under the hood, the make performs a step that we already know from the emulation and report generation

dpcpp -fintelfpga -Xshardware -fsycl-link=image -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME CMakeFiles/fpga_compile.fpga.dir/fpga_compile.cpp.o -o ../fpga_compile.fpga

the only difference is the -fsycl-link=image flag. It instructs the compiler to perform the full hardware synthesis and not to stop after the optimization report.
the binary fpga_compile.fpga is generated. It will handle the initialization and execution of the code on an FPGA (see next step).

Expected output

TBD

Note, that the build of the hardware design will create another report similar to the report that we discussed in the previous step. In contrast to the previous report, the new report contains the actual resource utilization of the design. More details on the analysis of the actual image can be found in Intel’s documentation.

6. Execute the hardware design on an FPGA.

After the hardware synthesis, we can allocate a suitably configured and equipped FPGA node and for execution.

srun --partition=fpga -A <your_project_acronym> --constraint=20.4.0_hpc -t 2:00:00 --pty bash

To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node

module load intel/oneAPI
module load bittware_520n
module load devel/CMake

./fpga_compile.fpga

Expected output

TBD

How to proceed

For more information using the tools, refer to

DPC++ FPGA Code Samples Guide,
Intel’s oneAPI Programming Guide and especially the FPGA flow,
the FPGA optimization guide,
and finally open-access book on Data Parallel C++.