Page Comparison

This guide will walk you walk you through the six steps required to use the Intel OpenCL FPGA toolkit on Noctua 2.

...

After, copy the vector_add example into your getting_started_with_fpgas workspace:

Code Block

cp -r /opt/cmsoftware/sharedFPGA/optIntelFPGA/intelFPGAopencl_prosdk/21.4.0/hld/examples_aoc/common .
cp -r /opt/cmsoftware/sharedFPGA/optIntelFPGA/intelFPGAopencl_prosdk/21.4.0/hld/examples_aoc/vector_add .

...

2. Setup the local software environment on Noctua2.

Code Block
module reset module load fpga module load fpga/intel/opencl_sdk/21.4.0 module load fpga/bittware/520n/20.4.0_hpc

Expand

title	Details

Without version number provided, the latest versions will be loadedWith module reset, previously loaded modules are cleaned up. The first module loaded, fpga, is a gateway module to the actual modules loaded in lines 3-4. Without version number provided, the latest versions will be loaded. To use a specific version, you can append the version, e.g. fpga/intel/opencl_sdk/21.4.0. All available versions can be queried with module avail fpga/ intel/opencl_sdk With the given commands the following modules are loaded

intelFPGAintel/opencl_prosdk: Loads the compilation infrastructure for Intel OpenCL FPGA code
bittware_/520n: Loads the drivers and board support package (BSP) for the Intel Stratix 10 card
toolchain/gompi: Loads the compilation infrastructure for the host code (most current C++ compilers will work)

Together, these modules setup paths Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the Stratix 10 as target card. Observe for example:

Code Block
echo $FPGA_BOARD_NAME p520_hpc_sg280l echo $AOCL_BOARD_PACKAGE_ROOT /opt/cmsoftware/sharedFPGA/optIntelFPGA/intelFPGAopencl_prosdk/20.4.0/hld/board/bittware_pcie/s10_hpc_default

If you have a project that was only validated with an older BSP, you can explicitly load the module for an older version of xrt, e.g. bittware_520n/19.4.0_hpc.

The table below shows the full mapping of valid SDK to BSP versions for the Intel OpenCL design flow. Make sure to match the allocated constraint for real hardware execution.

Include Page

	Intel FPGA SDK for OpenCL and Bittware BSP version combinations
	Intel FPGA SDK for OpenCL and Bittware BSP version combinations

...

Expand

title	Details

Behind the scenes the Makefile triggers the following command, putting together the correct OpenCL headers and libraries, to produce an executable bin/host:

g++ -O2 -fstack-protector -D_FORTIFY_SOURCE=2 -Wformat -Wformat-security -fPIE -fPIC -fPIC -I../common/inc -I/opt/cmsoftware/sharedFPGA/optIntelFPGA/intelFPGAopencl_prosdk/21.4.0/hld/host/include host/src/main.cpp ../common/src/AOCLUtils /opencl.cpp ../common/src/AOCLUtils/options.cpp -L/opt/cmsoftware/sharedFPGA/optIntelFPGA/intelFPGAopencl_prosdk/21.4.0/hld/host/linux64/lib -z noexecstack -Wl,-z,relro,-z,now -Wl,-Bsymbolic -pie -lOpenCL -lrt -lpthread -o bin/host

Further behind the scenes, the Makefile determines some of these compile parameters by invoking the command line tool aocl according to the actutal environment as set up with modules. You can look at these parameters by invoking these commands yourself and use them in your own build process:

Code Block
aocl compile-config aocl ldlibs aocl ldflags

...

Code Block
aoc -rtl -v -board=p520$FPGA_maxBOARD_sg280lNAME -board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10$AOCL_BOARD_PACKAGE_ROOT device/vector_add.cl -o vector_add_report

Expand

title	Details

Background:

-rtl: Tells the compiler to stop after report generation.
-v: Shows more details during the generation
-board=p520$FPGA_maxBOARD_sg280lNAME: Specifies the target FPGA board (Bittware 520N with Intel Stratix 10 GX 2800).
-board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10: $AOCL_BOARD_PACKAGE_ROOT: Specifies the BSP in the correct version. Normally this argument is not required as the compiler uses the environment variable AOCL_BOARD_PACKAGE_ROOT. Only if you you intentionally want to generate a report on an FPGA node allocated with a different constraint, this argument is needed.
device/vector_add.cl: Kernel file for vector_add written in OpenCL.
-o vector_add_report: Output directory.

In order to inspect the report, you may want to copy the report to your local file system or mount your working directory, for more information refer to [Noctua2-FileSystems]. For example you can compress the report on Noctua 2:

Code Block
tar -caf vector_add_report.tar.gz vector_add_report/reports

Then copy and decompress it from your local command line (e.g. Linux, MacOS, or Windows Subsystem for Linux):

Code Block

TBD
rsync -azv -e 'ssh -J <your-username>@fe.noctuanoctua2.pc2.uni-paderborn.de' <your-username>@ln-0001username>@n2login2:/scratch/<DIRECTORY_ASSIGNED_TO_YOUR_PROJECT>/getting_started_with_fpgas/vector_add/vector_add_report.tar.gz .

tar -xzf vectorfpga_addcompile_report.tar.gz

Open and inspect fpga_compile_report.prj/reports/report.html in your browser. The whole analysis contains little information, since the example is very simple. The main blocks of the report are

Throughput Analysis -> Loop Analysis: displays information about all loops and their optimization status (is it pipelined? what is the initiation interval (II) of the loop?, …).
Area Analysis (of System): details about the area utilization with architectural details into the generated hardware.
Views -> System Viewer: gives an overall overview of your kernels, their connections between each other and to external resources like memory.
Views -> Kernel Memory Viewer: displays the data movement and synchronization in your code.
Views -> Schedule Viewer: shows the scheduling of the generated instructions with corresponding latencies.
Bottleneck Viewer: identifies bottlenecks that reduce the performance of the design (lower maximum clock frequency of the design (F_max), increases the initiation interval (II), …).

Open and inspect vector_add_report/reports/report.html in your browser. The throughput analysis contains little information, since the example is very simple and ND-Range kernels as the one used in this example yield less details in the report than Single Work Item Kernels. The area analysis shows that the kernel system uses at most 1% of the available resources, much more complex or parallel kernels could fit on the FPGA. The system viewer shows two 32-bit Burst-coalesced load and one 32-bit Burst-coalesced store operations. Refer to Intel's documentation (in particular Programming and Best Practice guides) about the Intel FPGA for OpenCL to learn more about the properties and optimization goals in the report.

...

Code Block

#!/bin/sh

# synthesis_script.sh script

module#SBATCH load intelFPGA_pro
module load bittware_520n
module load toolchain/gompi

aoc -board=p520_max_sg280l -board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10 device/vector_add.cl -o bin-t 24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH -q fpgasynthesis
#SBATCH -A <your_project_acronym>
#SBATCH -p normal

module reset
module load fpga
module load intel/opencl_sdk
module load bittware/520n

aoc -board=$FPGA_BOARD_NAME -board-package=$AOCL_BOARD_PACKAGE_ROOT device/vector_add.cl -o bin/vector_add.aocx

Then, we submit the synthesis_script.sh to the slurm workload manager:

Code Block
sbatch --partition=fpgasyn -A <your_project_acronym> --mem=32G -t 24:00:00 ./synthesis_script.sh

...

Expand

title	Details and expected output with annotations

With --mem=32G, we allocate a small amount of main memory to this synthesis job, corresponding to the very small example we build here. For larger designs, typically at least 64G will be needed.

...

You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

Under the hood, the aoc command uses the following parameters

-board=p520_max_sg280l: Specifies the target FPGA board (Bittware 520N with Intel Stratix 10 GX 2800).
-board-package=/cm/shared/opt/intelFPGA_pro/20.4.0/hld/board/bittware_pcie/s10: Specifies the BSP in the correct version.
device/vector_add.cl: Kernel file for vector_add written in OpenCL.
-o bin/vector_add.aocx: Synthesized output (configuration for the FPGA).

Expected output

With --cpus-per-task=8, we use more cores to parallelize the synthesis
Using -q fpgasynthesis gives the FPGA bitstream-synthesis jobs a higher priority, see https://uni-paderborn.atlassian.net/wiki/spaces/PC2DOK/pages/13205994/Quality-of-Service+QoS+and+Job+Priorities#fpgasynthesis-Priority.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

Under the hood, the aoc command uses the following parameters

-board=$FPGA_BOARD_NAME: Specifies the target FPGA board (Bittware 520N with Intel Stratix 10 GX 2800).
-board-package=$AOCL_BOARD_PACKAGE_ROOT: Specifies the BSP in the correct version. Normally this argument is not required as the compiler uses the environment variable AOCL_BOARD_PACKAGE_ROOT.
device/vector_add.cl: Kernel file for vector_add written in OpenCL.
-o bin/vector_add.aocx: Synthesized output (configuration for the FPGA).

Expected output

Code Block

cpu-bind=MASK - n2cn0962, task  0  0 [765805]: mask |BBBBBBBB|--------||--------|--------||--------|--------||--------|--------||||--------|--------||--------|--------||--------|--------||--------|--------|  set
Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
  ==============================================================
  Intel recommends migrating existing designs to Intel(R) oneAPI
  to get access to the latest FPGA high-level design features, 
  optimizations, and development utilities.
  The FPGA SDK for OpenCL(TM) tool will be DEPRECATED after the
  22.4 Release.
  Visit the Intel oneAPI product page for migration advice, or
  go to the Intel(C) High-Level Design community forum for any
  questions or requests.
  ==============================================================

AOCL_TMP_DIR directory was specified at /scratch/FPGA.
Ensure Linux and Windows compiles do not share the same directory as files may be incompatible.
aoc: Running OpenCL parser....
aoc: OpenCL parser completed 
aoc: Linking Object files....
aoc: Optimizing and doing static analysis of code...
Compiler Warning: device/vector_add.cl:23: declaring global arguments 'x' and 'y' with no 'restrict' may lead to low performance for kernel 'vector_add'
aoc: First stage compilation completed successfully.
aoc: Compiling for FPGA. This process may take several hours to complete.  Prior to performing this compile, be sure to check the reports to ensure the design will meet your performance targets.  If the reports indicate performance targets are not being met, code edits may be required.  Please refer to the Intel FPGA SDK for OpenCL Best Practices Guide for information on performance tuning applications for FPGAs.

Note, that the build of the hardware design will create another report similar to the report that we discussed in the previous step. In contrast to the previous report, the new report contains the actual resource utilization of the design.

To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.

Expand

title	Use pre-synthesized design

In order to still use the slurm workload manager, we use a modified batch script copy_pre-synthesed_design_script.sh and submit it.

Code Block
#!/bin/sh # copy_pre-synthesed_design_script.sh # Instead of starting the actual synthesis we use pre-synthezed results. mv bin/vector_add_fpga.aocx bin/vector_add.aocx

Then, we submit the copy_pre-synthesed_design_script.sh to the slurm workload manager:

Code Block
sbatch --partition=allnormal -A <your_project_acronym> -t 00:10:00 ./copy_pre-synthesed_design_script.sh

We submit into --partition=allnormal.
With -t 00:10:00, we allocate a small amount of time to this file copy job.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

Please notice that the compiled kernel has the same name for emulation and FPGA execution (that is vector_add.aocx). If you override the pre-synthesized design accidentally, you can submit the script again.

...

Code Block
srun --partition=fpga -A <your_project_acronym> --constraint=bittware_520n_20.4.0_hpc -t 2:00:00 --pty bash

Expand

title	Details

Background information:

-A [YOUR_PROJECT_ACCOUNT]: Specify your project ID to charge compute time.
--constraint=bittware_520n_20.4.0_maxhpc: Specifies the correct version of the FPGA drivers (see BSP).
-N 1 -p fpga: Allocate one Noctua node with FPGAs. Two FPGAs are attached to one Noctua node.
-t 2:00:00: Allocate the node for 2 hours.
--pty bash: Get SSH terminal to allocated node.

To run the design, we load the proper modules and use the corresponding command on the allocated FPGA node

Code Block
module reset module load intelFPGA_profpga module load bittwareintel/opencl_520nsdk module load toolchainbittware/gompi520n ./bin/host

Expand

title	Expected output

Code Block

./bin/host 
Initializing OpenCL
Platform: Intel(R) FPGA SDK for OpenCL(TM)
Using 2 device(s)
  p520_maxhpc_sg280l : BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie0)
  p520_maxhpc_sg280l : BittWare Stratix 10 OpenCL platform (aclbitt_s10_pcie1)
Using AOCX: vector_add.aocx
MMD INFO : Disabling SmartVID (fix) polling
MMD INFO : Enabling SmartVID (fix) polling
MMD INFO : Disabling SmartVID (fix) polling
MMD INFO : Enabling SmartVID (fix) polling
Launching for device 0 (500000 elements)
Launching for device 1 (500000 elements)

Time: 3.600 ms
Kernel time (device 0): 1.291 ms
Kernel time (device 1): 1.303 ms

Verification: PASS

...

Versions Compared

Old Version 9

New Version Current

Key

2. Setup the local software environment on Noctua2.