Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
Run your first design on a Xilinx Alveo U280 card in 6 simple steps.
1. Get the latest examples from the Xilinx repository.
The latest Vitis Accel Examples are available on github. The master branch of the repository is always under development for the next release and might be incompatible with the latest version installed on our systems. Therefore, we directly checkout checkout a version specific branch that matches the Vitis version you are going to use in this guide.
git clone --branch 2023.1 https://github.com/Xilinx/Vitis_Accel_Examples.git
cd Vitis_Accel_Examples/cpp_kernels/loop_pipeline
In their example repository for acceleration with the Vitis tools, Xilinx grouped the examples by strategies for kernel development
C++ based (subdirectory cpp_kernels, recomended start for new users)
OpenCL kernel language based (subdirectory ocl_kernels)
RTL based (subdirectory rtl_kernels, recommended for experts only)
and by features for the host interface
based on the Xilinx xrt API (subdirectory host_xrt)
based on the OpenCL host API (subdirectory host)
based on Python (subdirectory host_py)
In this quick start we pick an example of C++ based kernels and the test uses the OpenCL host API, but feel free to explore the other options.
2. Setup the local software environment on Noctua 2.
With module reset, previously loaded modules are cleaned up. The first module loaded, fpga, is a gateway module to the actual module loaded in line 3. Without version number provided, the latest xilinx/xrt/2.15 module will be loaded. Under the hood it loads further modules for fitting versions of Vitis and the U280 shell. Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the U280 as target card. Observe for example:
If you have a project that was only validated with an older version of xrt or Vitis (also the example repository contains branches and tags for older tool versions), you can explicitly load the module for an older version of xrt, e.g. xilinx/xrt/2.8 (oldest supported version).
Supported xrt versions and connected Vitis and shell versions:
3. Build and test the example in emulation.
make run TARGET=sw_emu PLATFORM=$PLATFORM
Under the hood, the makefile performs four main steps here
Creating the host binary
g++ -o loop_pipeline ...
Creating the FPGA emulation binary with four substeps
#################################################################################################################################
### Running the emulation
#################################################################################################################################
g++ -o loop_pipeline ...
...
XCL_EMULATION_MODE=sw_emu ./loop_pipeline ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Found Platform
Platform Name: Xilinx
INFO: Reading ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Loading: './build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin'
Trying to program device[0]: xilinx_u280_gen3x16_xdma_1_202211_1
Device[0]: program successful!
|-------------------------+-------------------------|
| Kernel | Wall-Clock Time (ns) |
|-------------------------+-------------------------|
| vadd: simple | 2731 |
| vadd: pipelined | 7634 |
|-------------------------+-------------------------|
| Speedup | 0.357742 |
|-------------------------+-------------------------|
Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation.
Please refer to profile summary for kernel execution time for hardware emulation.
TEST PASSED.
This command will build and run the loop_pipeline example in two variants
vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'
vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'
Note: the expected output of the example contains performance figures measured in the host code. As noted even in the output, during emulation these measurements are no indication at all about the performance that is to be expected from hardware execution. In this example, the pipelined code, much faster on the FPGA, is slower in emulation.
4. Create and inspect reports as indicator of expected HW performance.
The Makefile in the example doesn’t contain an explicit target for reports. Reports get generated during the high-level-synthesis step when translating the C++ code to a hardware description. Since it is crucially important for any efficient development process to analyze reports regularly prior to actual hardware builds.
We demonstrate this step by manually picking the intermediate targets from the Makefile to generate the reports for the non-pipelined and pipelined variants of the loop_pipeline example:
vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'
make _x.hw.$PLATFORM/vadd.xo TARGET=hw PLATFORM=$PLATFORM
mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1
#################################################################################################################################
### Display hardware target.
#################################################################################################################################
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo' 'src/vector_addition_BAD.cpp'
...
INFO: [v++ 60-585] Compiling for hardware target
INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-242] Creating kernel: 'vadd'
...
#################################################################################################################################
### Output messages during HLS compilation.
#################################################################################################################################
===>The following messages were generated while performing high-level synthesis for kernel: vadd ...:
INFO: [v++ 204-61] Pipelining loop 'vadd_loop'.
WARNING: [v++ 200-885] The II Violation in module 'vadd_Pipeline_vadd_loop' (loop 'vadd_loop'): ...
Resolution: For help on HLS 200-885 see www.xilinx.com/cgi-bin/docs/rdoc?v=2023.1;t=hls+guidance;d=200-885.html
#################################################################################################################################
### II = 2 indicates that the kernel is not optimally pipelined.
#################################################################################################################################
INFO: [v++ 200-1470] Pipelining result : Target II = NA, Final II = 2, Depth = 143, loop 'vadd_loop'
INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were NOT satisfied.
INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz
#################################################################################################################################
### Next level of details can be found in the system estimate reports.
#################################################################################################################################
INFO: [v++ 60-594] Finished kernel compilation
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd/system_estimate_vadd.xtxt
INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command.
vitis_analyzer ... loop_pipeline/_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo.compile_summary
The reports reveal, that the first kernel (generated from src/vector_addition_BAD.cpp) with an II = 2 is not optimally pipelined.
Now the report for the second, pipelined kernel is generated:
vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'
make _x.hw.$PLATFORM/vadd_pipelined.xo TARGET=hw PLATFORM=$PLATFORM
mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1
#################################################################################################################################
### Display hardware target.
#################################################################################################################################
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo' 'src/vector_addition.cpp'
...
INFO: [v++ 60-585] Compiling for hardware target
INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-242] Creating kernel: 'vadd_pipelined'
#################################################################################################################################
### Output messages during HLS compilation.
#################################################################################################################################
===>The following messages were generated while performing high-level synthesis for kernel: vadd_pipelined ...:
INFO: [v++ 204-61] Pipelining loop 'read_a'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_a'
INFO: [v++ 204-61] Pipelining loop 'read_b'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_b'
INFO: [v++ 204-61] Pipelining loop 'write_c'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 71, loop 'write_c'
#################################################################################################################################
### II = 1 indicates that the kernel perfectly pipelined.
#################################################################################################################################
INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were satisfied.
INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz
#################################################################################################################################
### Next level of details can be found in the system estimate reports.
#################################################################################################################################
INFO: [v++ 60-594] Finished kernel compilation
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt
INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command.
vitis_analyzer ... /_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary
The second kernel contains three perfectly pipelined loops. However, will it actually be faster, when the three loops are executed one after another?
The pipelining properties that are output to console during HLS compilation are just the tip of the iceberg of generated reports. The next level of detail can be found in the system estimate reports at
_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd/system_estimate_vadd.xtxt for the non-pipelined loop and
_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt for the pipelined loop.
They indicate that the first, non-pipelined kernel is expected to consume fewer resources:
Area Information
Compute Unit Kernel Name Module Name FF LUT DSP BRAM URAM
------------ ----------- ----------------------- ---- ---- --- ---- ----
vadd_1 vadd vadd_Pipeline_vadd_loop 626 1092 0 0 0
vadd_1 vadd vadd 1755 2391 0 2 0
-------------------------------------------------------------------------------
More interestingly however, based on the expected trip counts that were annotate by the developer as auxiliary information to the compiler via #pragma HLS LOOP_TRIPCOUNT (see 'src/vector_addition.cpp'), the system estimates also give a first idea that the execution of one insufficiently pipelined loop from the first kernel may take longer than the combined execution time of three well pipelined loops:
Latency Information (for first, non-pipelined kernel)
Compute Unit Kernel Name Module Name Start Interval Best (cycles) Avg (cycles) Worst (cycles) Best (absolute) Avg (absolute) Worst (absolute)
------------ ----------- ----------------------- -------------- ------------- ------------ -------------- --------------- -------------- ----------------
vadd_1 vadd vadd_Pipeline_vadd_loop 2190 2190 2190 2190 7.299 us 7.299 us 7.299 us
vadd_1 vadd vadd undef undef undef undef undef undef undef
Latency Information (for second, pipelined kernel)
Compute Unit Kernel Name Module Name Start Interval Best (cycles) Avg (cycles) Worst (cycles) Best (absolute) Avg (absolute) Worst (absolute)
---------------- -------------- ------------------------------- -------------- ------------- ------------ -------------- --------------- -------------- ----------------
vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_read_a 201 201 201 201 0.670 us 0.670 us 0.670 us
vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_read_b 201 201 201 201 0.670 us 0.670 us 0.670 us
vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_write_c 199 199 199 199 0.663 us 0.663 us 0.663 us
vadd_pipelined_1 vadd_pipelined vadd_pipelined undef undef undef undef undef undef undef
5. Build the hardware design (bitstream)
This hardware build step (so-called hardware synthesis) can take lots of time and compute resources, so we create a batch script to submit the job to the slurm workload manager.
Then, we submit the synthesis_script.sh to the slurm workload manager:
sbatch ./synthesis_script.sh
The command line arguments passed to sbatch can also be encoded inside synthesis_script.sh in lines starting with #SBATCH
Synthesis is performed via the normal partition, as no FPGA hardware is required for this step and thus no full node is blocked by a single job.
For small examples, 8 CPU cores and 45G main memory are sufficient.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.
Expected output
#################################################################################################################################
### Display hardware target.
#################################################################################################################################
mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -I'src' -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo' 'src/vector_addition.cpp'
Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml'
****** v++ v2023.1 (64-bit)
INFO: [v++ 60-1306] Additional information associated with this v++ compile can be found at:
Reports: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined
Log files: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/vadd_pipelined
Running Dispatch Server on port: 37957
INFO: [v++ 60-1548] Creating build summary session with primary output .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary, at Tue Feb 15 10:02:21 2022
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 10:02:21 2022
INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/v++_compile_vadd_pipelined_guidance.html', at Tue Feb 15 10:02:23 2022
INFO: [v++ 60-895] Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm
INFO: [v++ 60-1578] This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa'
INFO: [v++ 74-78] Compiler Version string: 2023.1
INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-585] Compiling for hardware target
INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-242] Creating kernel: 'vadd_pipelined'
#################################################################################################################################
### Output messages during HLS compilation.
#################################################################################################################################
===>The following messages were generated while performing high-level synthesis for kernel: vadd_pipelined Log file: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined/vadd_pipelined/vitis_hls.log :
INFO: [v++ 204-61] Pipelining loop 'read_a'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_a'
INFO: [v++ 204-61] Pipelining loop 'read_b'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_b'
INFO: [v++ 204-61] Pipelining loop 'write_c'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 71, loop 'write_c'
INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were satisfied.
INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz
INFO: [v++ 60-594] Finished kernel compilation
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt
INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command.
vitis_analyzer .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary
INFO: [v++ 60-791] Total elapsed time: 0h 1m 7s
INFO: [v++ 60-1653] Closing dispatch client.
#################################################################################################################################
### Actual hardware build.
#################################################################################################################################
mkdir -p ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml'
****** v++ v2023.1 (64-bit)
**** SW Build 3363252 on 2021-10-14-04:41:01
** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved.
INFO: [v++ 60-1306] Additional information associated with this v++ link can be found at:
Reports: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link
Log files: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link
Running Dispatch Server on port: 39405
INFO: [v++ 60-1548] Creating build summary session with primary output .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin.link_summary, at Tue Feb 15 10:03:31 2022
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 10:03:31 2022
INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/v++_link_vector_addition.link_guidance.html', at Tue Feb 15 10:03:33 2022
INFO: [v++ 60-895] Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm
INFO: [v++ 60-1578] This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa'
INFO: [v++ 74-78] Compiler Version string: 2023.1
INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-629] Linking for hardware target
INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-1332] Run 'run_link' status: Not started
INFO: [v++ 60-1443] [10:03:40] Run run_link: Step system_link: Started
INFO: [v++ 60-1453] Command Line: system_link --xo .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo --xo .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo -keep --xpfm /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm --target hw --output_dir .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/int --temp_dir .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/sys_link
INFO: [v++ 60-1454] Run Directory: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/run_link
...
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/system_estimate_vector_addition.link.xtxt
INFO: [v++ 60-586] Created .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.ltx
INFO: [v++ 60-586] Created ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin
INFO: [v++ 60-1307] Run completed. Additional information can be found in:
Guidance: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/v++_link_vector_addition.link_guidance.html
Timing Report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/imp/impl_1_xilinx_u280_gen3x16_xdma_1_202211_1_bb_locked_timing_summary_routed.rpt
Vivado Log: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link/vivado.log
Steps Log File: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link/link.steps.log
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command.
vitis_analyzer .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin.link_summary
INFO: [v++ 60-791] Total elapsed time: 1h 6m 50s
INFO: [v++ 60-1653] Closing dispatch client.
v++ -p ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 --package.out_dir ./package.hw -o ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml'
****** v++ v2023.1 (64-bit)
**** SW Build 3363252 on 2021-10-14-04:41:01
** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved.
INFO: [v++ 60-1306] Additional information associated with this v++ package can be found at:
Reports: .../_x/reports/package
Log files: .../_x/logs/package
Running Dispatch Server on port: 37691
INFO: [v++ 60-1548] Creating build summary session with primary output .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin.package_summary, at Tue Feb 15 11:10:23 2022
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 11:10:23 2022
INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x/reports/package/v++_package_vector_addition_guidance.html', at Tue Feb 15 11:10:25 2022
INFO: [v++ 60-895] Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm
INFO: [v++ 60-1578] This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa'
INFO: [v++ 74-78] Compiler Version string: 2023.1
INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-2256] Packaging for hardware
INFO: [v++ 60-2460] Successfully copied a temporary xclbin to the output xclbin: ..././build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command.
vitis_analyzer .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin.package_summary
To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.
In order to still use the slurm workload manager, we use a modified batch script copy_pre-synthesed_design_script.sh and submit it.
#!/bin/sh
# copy_pre-synthesed_design_script.sh
# Instead of starting the actual synthesis with `make build TARGET=hw`,
# we extract the result from an archive.
tar -xvf /opt/software/FPGA/Xilinx/Vitis/2023.1/samples/loop_pipeline.tar.gz
# We fix some timestamps as a workaround. Otherwise `make` will not work.
touch _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo
touch _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
touch build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Then, we submit the copy_pre-synthesed_design_script.sh to the slurm workload manager:
sbatch --partition=normal -A <your_project_acronym> -t 00:10:00 ./copy_pre-synthesed_design_script.sh
We submit into --partition=normal.
With -t 00:10:00, we allocate a small amount of time to this file copy job.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.
6. Execute the hardware design on an FPGA.
After the hardware synthesis, we can allocate a suitably configured FPGA node for execution
srun --partition=fpga -A <your_project_acronym> --constraint=xilinx_u280_xrt2.15 -t 2:00:00 --pty bash
To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node
module reset
module load fpga
module load xilinx/xrt/2.15
make run TARGET=hw PLATFORM=$PLATFORM
...
Device[0]: program successful!
|-------------------------+-------------------------|
| Kernel | Wall-Clock Time (ns) |
|-------------------------+-------------------------|
| vadd: simple | 25082 |
| vadd: pipelined | 16158 |
|-------------------------+-------------------------|
| Speedup | 1.5523 |
|-------------------------+-------------------------|
Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation.
Please refer to profile summary for kernel execution time for hardware emulation.
TEST PASSED.
With regard to performance, we make two observations.
On the one hand, the first kernel (simple) is now indeed slower than the second (pipelined) kernel. However, the difference does not correspond to the estimate from the system estimate reports. Effects of global memory access come into play, that are not captured by the simple cycle model of the kernel logic.
Due to the non-optimized memory interface, the FPGA performance is actually slower than emulation on the host. Note, in the example application, the hardware design is also executed for many more iterations than the emulation, which makes the perceived difference appear even larger than the acutal one.
You can proceed with examples cpp_kernels/wide_mem_rw, performance/kernel_global_bandwidth and performance/hbm_bandwidth to see more optimized memory performance.