/
Intel oneAPI Quick Start Guide

Intel oneAPI Quick Start Guide

This guide will walk you through the six steps required to use the Intel oneAPI FPGA toolkit on Noctua 2.

1. Get the examples.

The latest Intel oneAPI-samples are available on github. The master branch of the repository is always under development for the next release and might be incompatible with the latest version installed on our systems. You can checkout a version specific branch that matches the oneAPI version you are going to use.

You can copy the files from our file system to speed-up the process with

cp -r /opt/software/FPGA/IntelFPGA/oneapi/24.1.0/oneAPI-samples . cd oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile

or clone the repository and checkout the correct version with

git clone https://github.com/oneapi-src/oneAPI-samples.git cd oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile git checkout release/2024.1

Intel grouped the FPGA examples (directory oneAPI-samples/DirectProgramming/C++SYCL_FPGA) into

  • ReferenceDesigns, demonstrates the implementation highly optimized algorithms and applications on an FPGA

    • anr: Adaptive Noise Reduction

    • crr: CRR Binomial Tree Model for Option Pricing

    • db: Database Query Acceleration

    • gzip: GZIP Compression

    • merge_sort: Merge Sort

    • mvdr_beamforming: MVDR Beamforming

    • qrd: QR Decomposition of Matrices

    • qri: QR-based inversion of Matrices

  • Tutorials, which itself consist of

    • DesignPatterns (double buffering, I/O streaming, loop optimizations, …)

    • Features (loop unrolling, pipes, usage of pragmas, …)

    • GettingStarted (our guide is based on this tutorial)

    • Tools (collect data for profiling)

In this quick start we pick the GettingStarted example from the tutorials, but feel free to explore the other options.

2. Setup the local software environment on Noctua2.

module reset module load fpga devel compiler module load intel/oneapi/24.1.0 module load bittware/520n/20.4.0_hpc module load CMake module load GCC

With module reset, previously loaded modules are cleaned up. The first three modules loaded are the gateway modules to the actual modules loaded in lines 3-6. Without version number provided, the latest versions will be loaded. To use a specific version, you can append the version, e.g. intel/oneapi/24.1.0. With the given commands the following modules are loaded

  • intel/oneapi: software stack for oneAPI development

  • bittware_520n: drivers and board support package (BSP) for the Intel Stratix 10 card

  • CMake: helps with compilation and build

  • GCC: the GNU Compiler Collection

Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the Stratix 10 as target card. Observe for example:

echo $FPGA_BOARD_NAME p520_hpc_sg280l echo $AOCL_BOARD_PACKAGE_ROOT /opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default

Supported oneAPI versions and bittware BSP versions:

mkdir build cd build cmake .. -DFPGA_DEVICE=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME

The build directory (for cmake) is created and configured with the correct target board (using the environment variables $AOCL_BOARD_PACKAGE_ROOT and $FPGA_BOARD_NAME populated via modules in the previous step).

Expected output

-- The CXX compiler identification is IntelLLVM 2024.1.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /opt/software/FPGA/IntelFPGA/oneapi/24.1.0/compiler/2024.1/bin/icpx - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Configuring the design with the following target: /opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default:p520_hpc_sg280l -- Configuring done (1.7s) -- Generating done (0.2s) -- Build files have been written to: .../oneapi-24.1.0/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile/build

3. Build and test the example in emulation.

make fpga_emu

Builds the emulation binary called fast_recompile.fpga_emu.

Under the hood, make performs two main steps here (simplified from CMake generated calls)

  • Creating the object files from source files

    • mkdir fast_recompile.fpga_emu.dir icpx -fsycl -Wall -fintelfpga -DFPGA_EMULATOR -o fast_recompile.fpga_emu.dir/kernel.cpp.o -c ../src/kernel.cpp icpx -fsycl -Wall -fintelfpga -DFPGA_EMULATOR -I ../../../../include/ -o fast_recompile.fpga_emu.dir/host.cpp.o -c ../src/host.cpp
  • the flag -fintelfpga instructs the compiler to compile for FPGA.

  • the flag -DFPGA_EMULATOR is used in the host code as a pre-processor macro to use either the emulator or real FPGA device for execution.

  • Linking the FPGA emulation binary

    • icpx -fsycl -fintelfpga fast_recompile.fpga_emu.dir/host.cpp.o fast_recompile.fpga_emu.dir/kernel.cpp.o -o fast_recompile.fpga_emu
      • produces the emulation binary called fast_recompile.fpga_emu.

./fast_recompile.fpga_emu
Running on device: Intel(R) FPGA Emulation Device PASSED: results are correct

Note, that the FPGA Emulation Device is selected.

Executes the emulation binary.

Note: the emulation in emulation gives no indication at all about the performance that is to be expected from hardware execution on a real FPGA.

4. Create and inspect reports as indicator of expected HW performance.

make report

Under the hood, make executes compilation and linking steps, which could be manually replicated as

mkdir fast_recompile_report.a.dir icpx -fsycl -fintelfpga -o fast_recompile_report.a.dir/kernel.cpp.o -c ../src/kernel.cpp icpx -fsycl -fintelfpga -I ../../../../include/ -o fast_recompile_report.a.dir/host.cpp.o -c ../src/host.cpp icpx -fsycl -fintelfpga -Xshardware -fsycl-link=early -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME fast_recompile_report.a.dir/host.cpp.o fast_recompile_report.a.dir/kernel.cpp.o -o fast_recompile_report.a

In addition to the compilation flags -fintelfpga and -Xshardware that we already used for compilation for emulation, we make use of

  • the -fsycl-link=early flag to instructs the compiler to generate the optimization reports and to stop afterwards

  • the -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME flag to tell the compiler the target FPGA card.

The generated report is a html file (called report.html) located in fast_recompile_report.prj/reports/ for our example.

  • the general location of the report is always <file_name>_report.prj/reports/report.html

With older tool versions, reports from this repository generated using cmake tend to have a problem with escape characters in one of the files. You can fix this problem with one of these commands.

sed -i 's/\\\\\\3/\\\\\\\\3/g; s/\\\\\\\\\"/\\\\\\\\\\\"/g' fast_recompile_report.prj/reports/resources/report_data.js
sed -i 's/\\\\\\3/\\\\\\\\3/g; s/\\\\\\\\\"/\\\\\\\\\\\"/g' fast_recompile.report.prj/reports/resources/report_data.js

Make sure the file was found correctly (no error message) or verify which path actually exists:

ls fast_recompile*report.prj/reports/resources/report_data.js

In order to inspect the report, you may want to copy the report to your local file system or mount your working directory, for more information refer to Data Management. For example you can compress the report on Noctua 2:

tar -caf fast_recompile_report.tar.gz fast_recompile_report.prj/reports

Then copy and decompress it from your local command line (e.g. Linux, MacOS, or Windows Subsystem for Linux):

rsync -azv -e 'ssh -J <your-username>@fe.noctua2.pc2.uni-paderborn.de' <your-username>@n2login2:/scratch/<DIRECTORY_ASSIGNED_TO_YOUR_PROJECT>/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/Tutorials/GettingStarted/fast_recompile/build/fast_recompile_report.tar.gz . tar -xzf fast_recompile_report.tar.gz

Open and inspect fast_recompile_report.prj/reports/report.html in your browser. The whole analysis contains little information, since the example is very simple. The main blocks of the report are

  • Throughput Analysis -> Loop Analysis: displays information about all loops and their optimization status (is it pipelined? what is the initiation interval (II) of the loop?, …).

  • Area Estimates: details about the area utilization with architectural details into the generated hardware.

  • Views -> System Viewer: gives an overall overview of your kernels, their connections between each other and to external resources like memory.

  • Views -> Kernel Memory Viewer: displays the data movement and synchronization in your code.

  • Views -> Schedule Viewer: shows the scheduling of the generated instructions with corresponding latencies.

The Area Analysis for our example shows that the kernel system uses at most 1% of the available resources, much more complex kernels could fit on the FPGA. The System Viewer shows two 32-bit Burst-coalesced load and one 32-bit Burst-coalesced store operations. Refer to Intel's documentation to learn more about the properties of the report and the optimization goals.

5. Build the hardware design (bitstream)

This hardware build step (so-called hardware synthesis) can take lots of time (hours!) and compute resources, so we create a batch script to submit the job to the slurm workload manager. Make sure to put your actual project acronym at the placeholder.

#!/bin/sh # synthesis_script.sh script #SBATCH -t 24:00:00 #SBATCH --cpus-per-task=8 #SBATCH --mem=32G #SBATCH -q fpgasynthesis #SBATCH -A <your_project_acronym> #SBATCH -p normal module reset module load fpga devel compiler module load intel/oneapi/24.1.0 module load bittware/520n/20.4.0_hpc module load CMake module load GCC make fpga

Then, we submit the synthesis_script.sh to the slurm workload manager:

sbatch synthesis_script.sh
  • With --mem=32G, we allocate a small amount of main memory to this synthesis job, corresponding to the very small example we build here. For larger designs, typically at least 64G will be needed.

  • With --cpus-per-task=8, we use more cores to parallelize the synthesis

  • Using -q fpgasynthesis gives the FPGA bitstream-synthesis jobs a higher priority, see Quality-of-Service (QoS) and Job Priorities.

  • You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

Under the hood, the make performs a step that we already know from the emulation and report generation

icpx -fsycl -fintelfpga -Xshardware -fsycl-link=image -Xsboard=$AOCL_BOARD_PACKAGE_ROOT:$FPGA_BOARD_NAME fast_recompile.fpga.dir/host.cpp.o fast_recompile.fpga.dir/kernel.cpp.o -o fast_recompile.fpga
  • the only difference is the -fsycl-link=image flag. It instructs the compiler to perform the full hardware synthesis and not to stop after the optimization report.

  • the binary fast_recompile.fpga is generated. It will handle the initialization and execution of the code on an FPGA (see next step).

Expected output

[ 33%] Generating dev_image.a aoc: Compiling for FPGA. This process may take several hours to complete. Prior to performing this compile, be sure to check the reports to ensure the design will meet your performance targets. If the reports indicate performance targets are not being met, code edits may be required. Please refer to the oneAPI FPGA Optimization Guide for information on performance tuning applications for FPGAs. [ 66%] Generating host.o [100%] Generating fast_recompile.fpga [100%] Built target fpga

Note, that the build of the hardware design will create another report similar to the report that we discussed in the previous step. In contrast to the previous report, the new report contains the actual resource utilization of the design. More details on the analysis of the actual image can be found in Intel’s documentation.

To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.

In order to still use the slurm workload manager, we use a modified batch script copy_pre-synthesed_design_script.sh and submit it.

#!/bin/sh # copy_pre-synthesed_design_script.sh # Instead of starting the actual synthesis with `make build TARGET=hw`, # we extract the result from an archive. tar -xvf /opt/software/FPGA/IntelFPGA/oneapi/24.1.0/pre-synthesized/fast_recompile_-_20.4.0_hpc_-_24.1.0.tar.gz

Then, we submit the copy_pre-synthesed_design_script.sh to the slurm workload manager:

sbatch --partition=normal -A <your_project_acronym> -t 00:10:00 ./copy_pre-synthesed_design_script.sh
  • Note that in this example, we pass the relevant job parameters directly to sbatch instead of encoding them in the jobscript as done in the synthesis_script.sh example. For many purposes, both variants are equally suitable.

  • We submit into --partition=normal.

  • With -t 00:10:00, we allocate a small amount of time to this file copy job.

  • You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

6. Execute the hardware design on an FPGA.

After the hardware synthesis, we can allocate a suitably configured and equipped FPGA node and for execution.

srun --partition=fpga -A <your_project_acronym> --constraint=bittware_520n_20.4.0_hpc -t 2:00:00 --pty bash

To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node

module reset module load fpga devel compiler module load intel/oneapi/24.1.0 module load bittware/520n/20.4.0_hpc module load CMake module load GCC ./fast_recompile.fpga
Running on device: p520_hpc_sg280l : BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie0) MMD INFO : Disabling SmartVID (fix) polling MMD INFO : Enabling SmartVID (fix) polling PASSED: results are correct

How to proceed

For more information using the tools, refer to

Related content