/
Xilinx Vitis Quick Start Guide

Xilinx Vitis Quick Start Guide

Run your first design on a Xilinx Alveo U280 card in 6 simple steps.

1. Get the latest examples from the Xilinx repository.

The latest Vitis Accel Examples are available on github. The master branch of the repository is always under development for the next release and might be incompatible with the latest version installed on our systems. Therefore, we directly checkout checkout a version specific branch that matches the Vitis version you are going to use in this guide.

git clone --branch 2023.1 https://github.com/Xilinx/Vitis_Accel_Examples.git cd Vitis_Accel_Examples/cpp_kernels/loop_pipeline

In their example repository for acceleration with the Vitis tools, Xilinx grouped the examples by strategies for kernel development

  • C++ based (subdirectory cpp_kernels, recomended start for new users)

  • OpenCL kernel language based (subdirectory ocl_kernels)

  • RTL based (subdirectory rtl_kernels, recommended for experts only)

and by features for the host interface

  • based on the Xilinx xrt API (subdirectory host_xrt)

  • based on the OpenCL host API (subdirectory host)

  • based on Python (subdirectory host_py)

In this quick start we pick an example of C++ based kernels and the test uses the OpenCL host API, but feel free to explore the other options.

2. Setup the local software environment on Noctua 2.

module reset module load fpga module load xilinx/xrt/2.15

With module reset, previously loaded modules are cleaned up. The first module loaded, fpga, is a gateway module to the actual module loaded in line 3. Without version number provided, the latest xilinx/xrt/2.15 module will be loaded. Under the hood it loads further modules for fitting versions of Vitis and the U280 shell. Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the U280 as target card. Observe for example:

echo $DEVICE xilinx_u280_gen3x16_xdma_1_202211_1 echo $PLATFORM_REPO_PATHS /opt/software/FPGA/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1

If you have a project that was only validated with an older version of xrt or Vitis (also the example repository contains branches and tags for older tool versions), you can explicitly load the module for an older version of xrt, e.g. xilinx/xrt/2.8 (oldest supported version).

Supported xrt versions and connected Vitis and shell versions:

3. Build and test the example in emulation.

make run TARGET=sw_emu PLATFORM=$PLATFORM

Under the hood, the makefile performs four main steps here

  • Creating the host binary

    • g++ -o loop_pipeline ...

  • Creating the FPGA emulation binary with four substeps

    • Building an object file for a non-pipelined loop

      • v++ -t sw_emu ... 'src/vector_addition_BAD.cpp'

    • Building an object file for a pipelined loop

      • v++ -t sw_emu ... 'src/vector_addition.cpp'

    • Linking the two object files

      • v++ -t sw_emu ... -l ... _x.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo _x.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo

    • Packaging the result to emulate execution of U280 on the CPU

      • v++ -p ...

  • Preparing an emulation environment

    • emconfigutil --platform xilinx_u280_gen3x16_xdma_1_202211_1 ...

  • Running the emulation

    • XCL_EMULATION_MODE=sw_emu ./loop_pipeline ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
  • Expected output

################################################################################################################################# ### Running the emulation ################################################################################################################################# g++ -o loop_pipeline ... ... XCL_EMULATION_MODE=sw_emu ./loop_pipeline ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin Found Platform Platform Name: Xilinx INFO: Reading ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin Loading: './build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin' Trying to program device[0]: xilinx_u280_gen3x16_xdma_1_202211_1 Device[0]: program successful! |-------------------------+-------------------------| | Kernel | Wall-Clock Time (ns) | |-------------------------+-------------------------| | vadd: simple | 2731 | | vadd: pipelined | 7634 | |-------------------------+-------------------------| | Speedup | 0.357742 | |-------------------------+-------------------------| Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation. Please refer to profile summary for kernel execution time for hardware emulation. TEST PASSED.

This command will build and run the loop_pipeline example in two variants

  • vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'

  • vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'

Note: the expected output of the example contains performance figures measured in the host code. As noted even in the output, during emulation these measurements are no indication at all about the performance that is to be expected from hardware execution. In this example, the pipelined code, much faster on the FPGA, is slower in emulation.

4. Create and inspect reports as indicator of expected HW performance.

The Makefile in the example doesn’t contain an explicit target for reports. Reports get generated during the high-level-synthesis step when translating the C++ code to a hardware description. Since it is crucially important for any efficient development process to analyze reports regularly prior to actual hardware builds.

We demonstrate this step by manually picking the intermediate targets from the Makefile to generate the reports for the non-pipelined and pipelined variants of the loop_pipeline example:

  • vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'

make _x.hw.$PLATFORM/vadd.xo TARGET=hw PLATFORM=$PLATFORM
mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1 ################################################################################################################################# ### Display hardware target. ################################################################################################################################# v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo' 'src/vector_addition_BAD.cpp' ... INFO: [v++ 60-585] Compiling for hardware target INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1 INFO: [v++ 60-242] Creating kernel: 'vadd' ... ################################################################################################################################# ### Output messages during HLS compilation. ################################################################################################################################# ===>The following messages were generated while performing high-level synthesis for kernel: vadd ...: INFO: [v++ 204-61] Pipelining loop 'vadd_loop'. WARNING: [v++ 200-885] The II Violation in module 'vadd_Pipeline_vadd_loop' (loop 'vadd_loop'): ... Resolution: For help on HLS 200-885 see www.xilinx.com/cgi-bin/docs/rdoc?v=2023.1;t=hls+guidance;d=200-885.html ################################################################################################################################# ### II = 2 indicates that the kernel is not optimally pipelined. ################################################################################################################################# INFO: [v++ 200-1470] Pipelining result : Target II = NA, Final II = 2, Depth = 143, loop 'vadd_loop' INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were NOT satisfied. INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz ################################################################################################################################# ### Next level of details can be found in the system estimate reports. ################################################################################################################################# INFO: [v++ 60-594] Finished kernel compilation INFO: [v++ 60-244] Generating system estimate report... INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd/system_estimate_vadd.xtxt INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. vitis_analyzer ... loop_pipeline/_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo.compile_summary

The reports reveal, that the first kernel (generated from src/vector_addition_BAD.cpp) with an II = 2 is not optimally pipelined.

Now the report for the second, pipelined kernel is generated:

  • vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'

make _x.hw.$PLATFORM/vadd_pipelined.xo TARGET=hw PLATFORM=$PLATFORM
mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1 ################################################################################################################################# ### Display hardware target. ################################################################################################################################# v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo' 'src/vector_addition.cpp' ... INFO: [v++ 60-585] Compiling for hardware target INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1 INFO: [v++ 60-242] Creating kernel: 'vadd_pipelined' ################################################################################################################################# ### Output messages during HLS compilation. ################################################################################################################################# ===>The following messages were generated while performing high-level synthesis for kernel: vadd_pipelined ...: INFO: [v++ 204-61] Pipelining loop 'read_a'. INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_a' INFO: [v++ 204-61] Pipelining loop 'read_b'. INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_b' INFO: [v++ 204-61] Pipelining loop 'write_c'. INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 71, loop 'write_c' ################################################################################################################################# ### II = 1 indicates that the kernel perfectly pipelined. ################################################################################################################################# INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were satisfied. INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz ################################################################################################################################# ### Next level of details can be found in the system estimate reports. ################################################################################################################################# INFO: [v++ 60-594] Finished kernel compilation INFO: [v++ 60-244] Generating system estimate report... INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. vitis_analyzer ... /_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary

The second kernel contains three perfectly pipelined loops. However, will it actually be faster, when the three loops are executed one after another?

The pipelining properties that are output to console during HLS compilation are just the tip of the iceberg of generated reports. The next level of detail can be found in the system estimate reports at

  • _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd/system_estimate_vadd.xtxt for the non-pipelined loop and

  • _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt for the pipelined loop.

They indicate that the first, non-pipelined kernel is expected to consume fewer resources:

Area Information Compute Unit Kernel Name Module Name FF LUT DSP BRAM URAM ------------ ----------- ----------------------- ---- ---- --- ---- ---- vadd_1 vadd vadd_Pipeline_vadd_loop 626 1092 0 0 0 vadd_1 vadd vadd 1755 2391 0 2 0 -------------------------------------------------------------------------------

Compared to the second, pipelined kernel:

Area Information Compute Unit Kernel Name Module Name FF LUT DSP BRAM URAM ---------------- -------------- ------------------------------- ---- ---- --- ---- ---- vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_read_a 1349 190 0 0 0 vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_read_b 1363 229 0 0 0 vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_write_c 858 606 0 0 0 vadd_pipelined_1 vadd_pipelined vadd_pipelined 5971 4154 0 30 0 -------------------------------------------------------------------------------

More interestingly however, based on the expected trip counts that were annotate by the developer as auxiliary information to the compiler via #pragma HLS LOOP_TRIPCOUNT (see 'src/vector_addition.cpp'), the system estimates also give a first idea that the execution of one insufficiently pipelined loop from the first kernel may take longer than the combined execution time of three well pipelined loops:

Latency Information (for first, non-pipelined kernel) Compute Unit Kernel Name Module Name Start Interval Best (cycles) Avg (cycles) Worst (cycles) Best (absolute) Avg (absolute) Worst (absolute) ------------ ----------- ----------------------- -------------- ------------- ------------ -------------- --------------- -------------- ---------------- vadd_1 vadd vadd_Pipeline_vadd_loop 2190 2190 2190 2190 7.299 us 7.299 us 7.299 us vadd_1 vadd vadd undef undef undef undef undef undef undef
Latency Information (for second, pipelined kernel) Compute Unit Kernel Name Module Name Start Interval Best (cycles) Avg (cycles) Worst (cycles) Best (absolute) Avg (absolute) Worst (absolute) ---------------- -------------- ------------------------------- -------------- ------------- ------------ -------------- --------------- -------------- ---------------- vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_read_a 201 201 201 201 0.670 us 0.670 us 0.670 us vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_read_b 201 201 201 201 0.670 us 0.670 us 0.670 us vadd_pipelined_1 vadd_pipelined vadd_pipelined_Pipeline_write_c 199 199 199 199 0.663 us 0.663 us 0.663 us vadd_pipelined_1 vadd_pipelined vadd_pipelined undef undef undef undef undef undef undef

5. Build the hardware design (bitstream)

This hardware build step (so-called hardware synthesis) can take lots of time and compute resources, so we create a batch script to submit the job to the slurm workload manager.

#!/bin/sh # synthesis_script.sh #SBATCH -t 24:00:00 #SBATCH --cpus-per-task=8 #SBATCH --mem=64G #SBATCH -q fpgasynthesis #SBATCH -A <your_project_acronym> #SBATCH -p normal module reset module load fpga module load xilinx/xrt/2.15 make build TARGET=hw PLATFORM=$PLATFORM

Then, we submit the synthesis_script.sh to the slurm workload manager:

sbatch ./synthesis_script.sh
  • The command line arguments passed to sbatch can also be encoded inside synthesis_script.sh in lines starting with #SBATCH

  • Synthesis is performed via the normal partition, as no FPGA hardware is required for this step and thus no full node is blocked by a single job.

  • For small examples, 8 CPU cores and 45G main memory are sufficient.

  • You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

  • Expected output

################################################################################################################################# ### Display hardware target. ################################################################################################################################# mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1 v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -I'src' -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo' 'src/vector_addition.cpp' Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml' ****** v++ v2023.1 (64-bit) INFO: [v++ 60-1306] Additional information associated with this v++ compile can be found at: Reports: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined Log files: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/vadd_pipelined Running Dispatch Server on port: 37957 INFO: [v++ 60-1548] Creating build summary session with primary output .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary, at Tue Feb 15 10:02:21 2022 INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 10:02:21 2022 INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/v++_compile_vadd_pipelined_guidance.html', at Tue Feb 15 10:02:23 2022 INFO: [v++ 60-895] Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm INFO: [v++ 60-1578] This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa' INFO: [v++ 74-78] Compiler Version string: 2023.1 INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release. INFO: [v++ 60-585] Compiling for hardware target INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1 INFO: [v++ 60-242] Creating kernel: 'vadd_pipelined' ################################################################################################################################# ### Output messages during HLS compilation. ################################################################################################################################# ===>The following messages were generated while performing high-level synthesis for kernel: vadd_pipelined Log file: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined/vadd_pipelined/vitis_hls.log : INFO: [v++ 204-61] Pipelining loop 'read_a'. INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_a' INFO: [v++ 204-61] Pipelining loop 'read_b'. INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_b' INFO: [v++ 204-61] Pipelining loop 'write_c'. INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 71, loop 'write_c' INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were satisfied. INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz INFO: [v++ 60-594] Finished kernel compilation INFO: [v++ 60-244] Generating system estimate report... INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. vitis_analyzer .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary INFO: [v++ 60-791] Total elapsed time: 0h 1m 7s INFO: [v++ 60-1653] Closing dispatch client. ################################################################################################################################# ### Actual hardware build. ################################################################################################################################# mkdir -p ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1 v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml' ****** v++ v2023.1 (64-bit) **** SW Build 3363252 on 2021-10-14-04:41:01 ** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved. INFO: [v++ 60-1306] Additional information associated with this v++ link can be found at: Reports: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link Log files: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link Running Dispatch Server on port: 39405 INFO: [v++ 60-1548] Creating build summary session with primary output .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin.link_summary, at Tue Feb 15 10:03:31 2022 INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 10:03:31 2022 INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/v++_link_vector_addition.link_guidance.html', at Tue Feb 15 10:03:33 2022 INFO: [v++ 60-895] Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm INFO: [v++ 60-1578] This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa' INFO: [v++ 74-78] Compiler Version string: 2023.1 INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release. INFO: [v++ 60-629] Linking for hardware target INFO: [v++ 60-423] Target device: xilinx_u280_gen3x16_xdma_1_202211_1 INFO: [v++ 60-1332] Run 'run_link' status: Not started INFO: [v++ 60-1443] [10:03:40] Run run_link: Step system_link: Started INFO: [v++ 60-1453] Command Line: system_link --xo .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo --xo .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo -keep --xpfm /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm --target hw --output_dir .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/int --temp_dir .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/sys_link INFO: [v++ 60-1454] Run Directory: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/run_link ... INFO: [v++ 60-244] Generating system estimate report... INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/system_estimate_vector_addition.link.xtxt INFO: [v++ 60-586] Created .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.ltx INFO: [v++ 60-586] Created ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin INFO: [v++ 60-1307] Run completed. Additional information can be found in: Guidance: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/v++_link_vector_addition.link_guidance.html Timing Report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/imp/impl_1_xilinx_u280_gen3x16_xdma_1_202211_1_bb_locked_timing_summary_routed.rpt Vivado Log: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link/vivado.log Steps Log File: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link/link.steps.log INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. vitis_analyzer .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin.link_summary INFO: [v++ 60-791] Total elapsed time: 1h 6m 50s INFO: [v++ 60-1653] Closing dispatch client. v++ -p ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 --package.out_dir ./package.hw -o ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml' ****** v++ v2023.1 (64-bit) **** SW Build 3363252 on 2021-10-14-04:41:01 ** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved. INFO: [v++ 60-1306] Additional information associated with this v++ package can be found at: Reports: .../_x/reports/package Log files: .../_x/logs/package Running Dispatch Server on port: 37691 INFO: [v++ 60-1548] Creating build summary session with primary output .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin.package_summary, at Tue Feb 15 11:10:23 2022 INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 11:10:23 2022 INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x/reports/package/v++_package_vector_addition_guidance.html', at Tue Feb 15 11:10:25 2022 INFO: [v++ 60-895] Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm INFO: [v++ 60-1578] This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa' INFO: [v++ 74-78] Compiler Version string: 2023.1 INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release. INFO: [v++ 60-2256] Packaging for hardware INFO: [v++ 60-2460] Successfully copied a temporary xclbin to the output xclbin: ..././build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. vitis_analyzer .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin.package_summary

To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.

In order to still use the slurm workload manager, we use a modified batch script copy_pre-synthesed_design_script.sh and submit it.

#!/bin/sh # copy_pre-synthesed_design_script.sh # Instead of starting the actual synthesis with `make build TARGET=hw`, # we extract the result from an archive. tar -xvf /opt/software/FPGA/Xilinx/Vitis/2023.1/samples/loop_pipeline.tar.gz # We fix some timestamps as a workaround. Otherwise `make` will not work. touch _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo touch _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo touch build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin

Then, we submit the copy_pre-synthesed_design_script.sh to the slurm workload manager:

sbatch --partition=normal -A <your_project_acronym> -t 00:10:00 ./copy_pre-synthesed_design_script.sh
  • We submit into --partition=normal.

  • With -t 00:10:00, we allocate a small amount of time to this file copy job.

  • You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

6. Execute the hardware design on an FPGA.

After the hardware synthesis, we can allocate a suitably configured FPGA node for execution

srun --partition=fpga -A <your_project_acronym> --constraint=xilinx_u280_xrt2.15 -t 2:00:00 --pty bash

To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node

module reset module load fpga module load xilinx/xrt/2.15 make run TARGET=hw PLATFORM=$PLATFORM
... Device[0]: program successful! |-------------------------+-------------------------| | Kernel | Wall-Clock Time (ns) | |-------------------------+-------------------------| | vadd: simple | 25082 | | vadd: pipelined | 16158 | |-------------------------+-------------------------| | Speedup | 1.5523 | |-------------------------+-------------------------| Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation. Please refer to profile summary for kernel execution time for hardware emulation. TEST PASSED.

With regard to performance, we make two observations.

  • On the one hand, the first kernel (simple) is now indeed slower than the second (pipelined) kernel. However, the difference does not correspond to the estimate from the system estimate reports. Effects of global memory access come into play, that are not captured by the simple cycle model of the kernel logic.

  • Due to the non-optimized memory interface, the FPGA performance is actually slower than emulation on the host. Note, in the example application, the hardware design is also executed for many more iterations than the emulation, which makes the perceived difference appear even larger than the acutal one.

You can proceed with examples cpp_kernels/wide_mem_rw, performance/kernel_global_bandwidth and performance/hbm_bandwidth to see more optimized memory performance.

 

Related content