Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This guide will walk you walk you through the six steps required to use the Intel oneAPI OpenCL FPGA toolkit on Noctua 2.

1. Get the latest examples.

We will use the vector_add example code that is shipped with the Intel FPGA SDK for OpenCL.

We recommend working in /scratch/ because FPGA designs consume a considerable amount of disk space. Navigate to the directory assigned to your project under /scratch/ and create a working directory for this example:

Code Block
languagebash
git clone https://github.com/oneapi-src/oneAPI-samples.git
cd oneAPI-samples/DirectProgramming/DPC++FPGA/Tutorials/GettingStarted/fpga_compilecd /scratch/[DIRECTORY_ASSIGNED_TO_YOUR_PROJECT]
mkdir getting_started_with_fpgas
cd getting_started_with_fpgas

After, copy the vector_add example into your getting_started_with_fpgas workspace:

Code Block
cp -r /cm/shared/opt/intelFPGA_pro/21.4.0/hld/examples_aoc/common .
cp -r /cm/shared/opt/intelFPGA_pro/21.4.0/hld/examples_aoc/vector_add .

Expand
titleDetails

...

Intel grouped the FPGA examples (directory oneAPI-samples/DirectProgramming/DPC++FPGA) into

  • ReferenceDesigns, demonstrates the implementation highly optimized algorithms and applications on an FPGA

    • anr: Adaptive Noise Reduction

    • crr: CRR Binomial Tree Model for Option Pricing

    • db: Database Query Acceleration

    • gzip: GZIP Compression

    • merge_sort: Merge Sort

    • mvdr_beamforming: MVDR Beamforming

    • qrd: QR Decomposition of Matrices

    • qri: QR-based inversion of Matrices

  • Tutorials, which itself consist of

    • DesignPatterns (double buffering, I/O streaming, loop optimizations, …)

    • Features (loop unrolling, pipes, usage of pragmas, …)

    • GettingStarted (our guide is based on this tutorial)

    • Tools (collect data for profiling)

In this quick start we pick the GettingStarted example from the tutorials, but feel free to explore the other options.

  • common: Includes helper and utility functions to interface with the FPGA

  • vector_add: Includes the actual application code

Other available examples to try are:

Code Block
asian_option
channelizer
compression
compute_score
double_buffering
fd3d
fft1d
fft1d_offchip
fft2d
hello_world
jpeg_decoder
library_example1
library_example2
library_hls_sot
library_matrix_mult
local_memory_cache
loopback_hostpipe
mandelbrot
matrix_mult
multithread_vector_operation
n_way_buffering
optical_flow
sobel_filter
tdfir
vector_add                      <--- used in this guide
video_downscaling
web

2. Setup the local software environment on Noctua2.

Code Block
module load intel/oneAPIintelFPGA_pro
module load bittware_520n
module load develtoolchain/CMake gompi
Expand
titleDetails

Without version number provided, the latest versions will be loaded. To use a specific version, you can append the version, e.g. intel/oneAPI/22.1. intelFPGA_pro/20.4.0_hpc. All available versions can be queried with module avail intelFPGA_pro With the given commands the following modules are loaded

  • intel/oneAPI: software stack for oneAPI developmentintelFPGA_pro: Loads the compilation infrastructure for FPGA code

  • bittware_520n: drivers and board support package (BSP) for the Intel Stratix 10 cardCMake: helps with compilation and build

  • toolchain/gompi: Loads the compilation infrastructure for the host code (most current C++ compilers will work)

Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the Stratix 10 as target card. Observe for example:

Code Block
echo $FPGA_BOARD_NAME
p520_hpc_sg280l

echo $AOCL_BOARD_PACKAGE_ROOT
/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10_hpc_default

If you have a project that was only validated with an older BSP, you can explicitly load the module for an older version of xrt, e.g. bittware_520n/19.4.0_hpc.

Supported oneAPI OpenCL SDK versions and bittware BSPversionsBSP versions:

TBD

3. Build and test the example in emulation.

The compilation is divided into two parts:

  • host code: executed on the CPU. Performs initialization, data handling and FPGA device setup. Host code is compiled with a regular GCC compiler.

  • kernel code: executed on the FPGA or often in emulation on the CPU.

In this step we will first compile the host code

...

and then compile the kernel code for emulation on the CPU.

Code Block
cd vector_add
make all

Builds the emulation binary called host in the subdirectory bin.

-- The CXX compiler identification is IntelLLVM 2022.0.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler:

Behind the scenes the Makefile triggers the following command, putting together the correct OpenCL headers and libraries, to produce an executable bin/host:

g++ -O2 -fstack-protector -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -fPIE -fPIC -fPIC -I../common/inc -I/cm/shared/opt/

intel_oneapi/2022.1/compiler/2022.0.1/linux/bin/dpcpp - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Configuring the design to run on FPGA board

intelFPGA_pro/21.4.0/hld/host/include host/src/main.cpp ../common/src/AOCLUtils /opencl.cpp ../common/src/AOCLUtils/options.cpp -L/cm/shared/opt/intelFPGA_pro/

20

21.4.0/hld/

board

host/

bittware_pcie/s10_hpc_default:p520_hpc_sg280l -- Configuring done -- Generating done -- Build files have been written to: .../oneAPI-samples/DirectProgramming/DPC++FPGA/Tutorials/GettingStarted/fpga_compile/build
Expand
titleDetails

The build directory (for cmake) is created and configured with the correct target board (using the environment variables $AOCL_BOARD_PACKAGE_ROOT and $FPGA_BOARD_NAME populated via modules in the previous step).

Expected output

Code Block

3. Build and test the example in emulation.

Code Block
make fpga_emu

Builds the emulation binary called fpga_compile.fpga_emu.

Expand
titleDetails

Under the hood, the make performs two main steps here

  • Creating the object files from source files

    • dpcpp -Wall -fintelfpga -DFPGA_EMULATOR ... -o .../fpga_compile.cpp.o .../fpga_compile.cpp

      • the flag -fintelfpga instructs the compiler to compile for FPGA.

      • the flag -DFPGA_EMULATOR is used in the host code as a pre-processor macro to use either the emulator or real FPGA device for execution.

  • Linking the FPGA emulation binary

    • dpcpp -fintelfpga CMakeFiles/fpga_compile.fpga_emu.dir/fpga_compile.cpp.o -o ../fpga_compile.fpga_emu

      • produces the emulation binary called fpga_compile.fpga_emu.

Code Block
./fpga_compile.fpga_emu

linux64/lib -z noexecstack -Wl,-z,relro,-z,now -Wl,-Bsymbolic -pie -lOpenCL -lrt -lpthread -o bin/host

Further behind the scenes, the Makefile determines some of these compile parameters by invoking the command line tool aocl according to the actutal environment as set up with modules. You can look at these parameters by invoking these commands yourself and use them in your own build process:

Code Block
aocl compile-config
aocl ldlibs
aocl ldflags

Now that the host code is generated, we can compile the kernel code:

Code Block
aoc -march=emulator -no-interleaving=default device/vector_add.cl -o bin/vector_add.aocx

Expand
titleDetails

In contrast to the compilation of CPU or GPU code, the compilation (or often called synthesis) of FPGA kernel code will take several hours or even days. Before synthesis and hardware execution, it is highly recommended to check the functionality of your design in emulation. The emulation compiles the FPGA kernel code for an emulator for the CPU (as the same suggests) and can be done within seconds to minutes.

  • -march=emulator: Tells the compiler to compile for CPU emulation.

Having the host and kernel compiled, we can execute the program:

Code Block
./bin/host -emulator
Expand
titleExpected output
Code Block
Running on device:./bin/host -emulator

Initializing OpenCL
Platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
Using 1 device(s)
  Intel(R) FPGA Emulation Device
PASSED: results are correct
Using AOCX: vector_add.aocx
Launching for device 0 (1000000 elements)

Time: 8.075 ms
Kernel time (device 0): 1.071 ms

Verification: PASS

Note, that the FPGA Emulation Device is selected.

Executes the emulation binary.

Note: the emulation in emulation gives no indication at all about the performance that is to be expected from hardware execution on a real FPGA.

4. Create and inspect reports as indicator of expected HW performance.

...

To check if the kernel can be translated into an efficient FPGA design, intermediate files and an .html report can be generated with the following command:

Code Block
aoc -rtl -v -board=p520_max_sg280l -board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10 device/vector_add.cl -o vector_add_report
Expand
titleDetails

In addition to the compilation flags -fintelfpga and -Xshardware that we already used for compilation for emulation, we make use of

  • the -fsycl-link=early flag to instructs the compiler to generate the optimization reports and to stop afterwards

  • the -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME flag to tell the compiler the target FPGA card.

The generated report is a html file (called report.html) located in fpga_compile_report.prj/reports/ for our example.

the general location of the report is always <file_name>.prj/reports/report.html

Background:

  • -rtl: Tells the compiler to stop after report generation.

  • -v: Shows more details during the generation

  • -board=p520_max_sg280l: Specifies the target FPGA board (Bittware 520N with Intel Stratix 10 GX 2800).

  • -board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10: Specifies the BSP in the correct version.

  • device/vector_add.cl: Kernel file for vector_add written in OpenCL.

  • -o vector_add_report: Output directory.

In order to inspect the report, you may want to copy the report to your local file system or mount your working directory, for more information refer to [Noctua2-FileSystems]. For example you can compress the report on Noctua 2:

Code Block
tar -caf fpgavector_compileadd_report.tar.gz fpgavector_compileadd_report.prj/reports

Then copy and decompress it from your local command line (e.g. Linux, MacOS, or Windows Subsystem for Linux):

Code Block
TDB
rsync -azv -e 'ssh -J <your-username>@fe.noctua.pc2.uni-paderborn.de' <your-username>@ln-0001:/scratch/<DIRECTORY_ASSIGNED_TO_YOUR_PROJECT>/getting_started_with_fpgas/vector_add/vector_add_report.tar.gz .

tar -xzf vector_add_report.tar.gz

Open and inspect fpga_compile_report.prj/reports/report.html in your browser. The whole analysis contains little information, since the example is very simple. The main blocks of the report are

  • Throughput Analysis -> Loop Analysis: displays information about all loops and their optimization status (is it pipelined? what is the initiation interval (II) of the loop?, …).

  • Area Analysis (of System): details about the area utilization with architectural details into the generated hardware.

  • Views -> System Viewer: gives an overall overview of your kernels, their connections between each other and to external resources like memory.

  • Views -> Kernel Memory Viewer: displays the data movement and synchronization in your code.

  • Views -> Schedule Viewer: shows the scheduling of the generated instructions with corresponding latencies.

  • Bottleneck Viewer: identifies bottlenecks that reduce the performance of the design (lower maximum clock frequency of the design (Fmax), increases the initiation interval (II), …).

The Area Analysis for our example Open and inspect vector_add_report/reports/report.html in your browser. The throughput analysis contains little information, since the example is very simple and ND-Range kernels as the one used in this example yield less details in the report than Single Work Item Kernels. The area analysis shows that the kernel system uses at most 1% of the available resources, much more complex or parallel kernels could fit on the FPGA. The System Viewer system viewer shows two 32-bit Burst-coalesced load and one 32-bit Burst-coalesced store operations. Refer to Intel's documentation (in particular Programming and Best Practice guides) about the Intel FPGA for OpenCL to learn more about the properties of the report and the optimization goals in the report.

5. Build the hardware design (bitstream)

In this step we build the kernel code for an executed on the FPGA. This hardware build step (so-called hardware synthesis) can take lots of time (hours!) and compute resources, so we create a batch script to submit the job to the slurm workload manager.

Code Block
#!/bin/sh

# synthesis_script.sh script

module load intel/oneAPIintelFPGA_pro
module load bittware_520n
module load devel/CMake

make fpga toolchain/gompi

aoc -board=p520_max_sg280l -board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10 device/vector_add.cl -o bin/vector_add.aocx

Then, we submit the synthesis_script.sh to the slurm workload manager:

...

Expand
titleDetails and expected output with annotations
  • With --mem=32G, we allocate a small amount of main memory to this synthesis job, corresponding to the very small example we build here. For larger designs, typically at least 64G will be needed.

  • You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

Under the hood, the make performs a step that we already know from the emulation and report generation

Code Block
dpcpp -fintelfpga -Xshardware -fsycl-link=image -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME CMakeFiles/fpga_compile.fpga.dir/fpga_compile.cpp.o -o ../fpga_compile.fpga 
  • the only difference is the -fsycl-link=image flag. It instructs the compiler to perform the full hardware synthesis and not to stop after the optimization report.

  • the binary fpga_compile.fpga is generated. It will handle the initialization and execution of the code on an FPGA (see next step

    aoc command uses the following parameters

    • -board=p520_max_sg280l: Specifies the target FPGA board (Bittware 520N with Intel Stratix 10 GX 2800).

    • -board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10: Specifies the BSP in the correct version.

    • device/vector_add.cl: Kernel file for vector_add written in OpenCL.

    • -o bin/vector_add.aocx: Synthesized output (configuration for the FPGA).

    Expected output

    Code Block
    TBD

    Note, that the build of the hardware design will create another report similar to the report that we discussed in the previous step. In contrast to the previous report, the new report contains the actual resource utilization of the design. More details on the analysis of the actual image can be found in Intel’s documentation.

    6. Execute the hardware design on an FPGA.

    After the hardware synthesis (and host code compilation), we can allocate a suitably configured and equipped FPGA node and for execution.

    Code Block
    srun --partition=fpga -A <your_project_acronym> --constraint=20.4.0_hpc -t 2:00:00 --pty bash
    Expand
    titleDetails

    Background information:

    • -A [YOUR_PROJECT_ACCOUNT]: Specify your project ID to charge compute time.

    • --constraint=20.4.0_max: Specifies the correct version of the FPGA drivers (see BSP).

    • -N 1 -p fpga: Allocate one Noctua node with FPGAs. Two FPGAs are attached to one Noctua node.

    • -t 2:00:00: Allocate the node for 2 hours.

    • --pty bash: Get SSH terminal to allocated node.

    To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node

    Code Block
    module load intel/oneAPIintelFPGA_pro
    module load bittware_520n
    module load develtoolchain/CMakegompi
    
    ./fpga_compile.fpgabin/host
    Expand
    titleExpected output
    Code Block
    TBD

    How to proceed

    For more information using the tools, refer to

    ...

    DPC++ FPGA Code Samples Guide,

    ...

    Intel’s oneAPI Programming Guide and especially the FPGA flow,

    ...

    the FPGA optimization guide,

    ...

    ./bin/host 
    Initializing OpenCL
    Platform: Intel(R) FPGA SDK for OpenCL(TM)
    Using 2 device(s)
      p520_max_sg280l : BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie0)
      p520_max_sg280l : BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie1)
    Using AOCX: vector_add.aocx
    MMD INFO : Disabling SmartVID (fix) polling
    MMD INFO : Enabling SmartVID (fix) polling
    MMD INFO : Disabling SmartVID (fix) polling
    MMD INFO : Enabling SmartVID (fix) polling
    Launching for device 0 (500000 elements)
    Launching for device 1 (500000 elements)
    
    Time: 3.600 ms
    Kernel time (device 0): 1.291 ms
    Kernel time (device 1): 1.303 ms
    
    Verification: PASS

    Congratulations. You have executed a real program on an FPGA.

    How to proceed

    Now that you have successfully compiled and ran the example code on our FPGAs you can proceed in various directions

    • look into the source code of the vector_add example.

    • try one of the other examples mentioned above. Start with an example that is as close as possible to the actual problem that you try to accelerate using FPGAs.

    • visit our main FPGA documentation page to learn more about the used parameters, other options and troubleshooting common problems.

    • do not hesitate to drop us an Email if you face any problems, need support or have any questions. Look for staff with Scientific Advisor FPGA Acceleration as their domain to contact the right person.