Xilinx Vitis Quick Start Guide

Run your first design on a Xilinx Alveo U280 card in 6 simple steps.

1. Get the latest examples from the Xilinx repository.

The latest Vitis Accel Examples are available on github. The master branch of the repository is always under development for the next release and might be incompatible with the latest version installed on our systems. Therefore, we directly checkout checkout a version specific branch that matches the Vitis version you are going to use in this guide.

git clone --branch 2023.1 https://github.com/Xilinx/Vitis_Accel_Examples.git
cd Vitis_Accel_Examples/cpp_kernels/loop_pipeline

In their example repository for acceleration with the Vitis tools, Xilinx grouped the examples by strategies for kernel development

C++ based (subdirectory cpp_kernels, recomended start for new users)
OpenCL kernel language based (subdirectory ocl_kernels)
RTL based (subdirectory rtl_kernels, recommended for experts only)

and by features for the host interface

based on the Xilinx xrt API (subdirectory host_xrt)
based on the OpenCL host API (subdirectory host)
based on Python (subdirectory host_py)

In this quick start we pick an example of C++ based kernels and the test uses the OpenCL host API, but feel free to explore the other options.

2. Setup the local software environment on Noctua 2.

module reset
module load fpga
module load xilinx/xrt/2.15

With module reset, previously loaded modules are cleaned up. The first module loaded, fpga, is a gateway module to the actual module loaded in line 3. Without version number provided, the latest xilinx/xrt/2.15 module will be loaded. Under the hood it loads further modules for fitting versions of Vitis and the U280 shell. Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the U280 as target card. Observe for example:

echo $DEVICE
xilinx_u280_gen3x16_xdma_1_202211_1

echo $PLATFORM_REPO_PATHS
/opt/software/FPGA/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1

If you have a project that was only validated with an older version of xrt or Vitis (also the example repository contains branches and tags for older tool versions), you can explicitly load the module for an older version of xrt, e.g. xilinx/xrt/2.8 (oldest supported version).

Supported xrt versions and connected Vitis and shell versions:

3. Build and test the example in emulation.

make run TARGET=sw_emu PLATFORM=$PLATFORM

Under the hood, the makefile performs four main steps here

Creating the host binary
- g++ -o loop_pipeline ...
Creating the FPGA emulation binary with four substeps
- Building an object file for a non-pipelined loop
  - v++ -t sw_emu ... 'src/vector_addition_BAD.cpp'
- Building an object file for a pipelined loop
  - v++ -t sw_emu ... 'src/vector_addition.cpp'
- Linking the two object files
  - v++ -t sw_emu ... -l ... _x.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo _x.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
- Packaging the result to emulate execution of U280 on the CPU
  - v++ -p ...
Preparing an emulation environment
- emconfigutil --platform xilinx_u280_gen3x16_xdma_1_202211_1 ...
Running the emulation
- XCL_EMULATION_MODE=sw_emu ./loop_pipeline ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Expected output

#################################################################################################################################
### Running the emulation
#################################################################################################################################

g++ -o loop_pipeline ...
...
XCL_EMULATION_MODE=sw_emu ./loop_pipeline ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Found Platform
Platform Name: Xilinx
INFO: Reading ./build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Loading: './build_dir.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin'
Trying to program device[0]: xilinx_u280_gen3x16_xdma_1_202211_1
Device[0]: program successful!
|-------------------------+-------------------------|
| Kernel                  |    Wall-Clock Time (ns) |
|-------------------------+-------------------------|
| vadd: simple            |                    2731 |
| vadd: pipelined         |                    7634 |
|-------------------------+-------------------------|
| Speedup                 |                0.357742 |
|-------------------------+-------------------------|
Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation.
Please refer to profile summary for kernel execution time for hardware emulation.
TEST PASSED.

This command will build and run the loop_pipeline example in two variants

vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'
vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'

Note: the expected output of the example contains performance figures measured in the host code. As noted even in the output, during emulation these measurements are no indication at all about the performance that is to be expected from hardware execution. In this example, the pipelined code, much faster on the FPGA, is slower in emulation.

4. Create and inspect reports as indicator of expected HW performance.

The Makefile in the example doesn’t contain an explicit target for reports. Reports get generated during the high-level-synthesis step when translating the C++ code to a hardware description. Since it is crucially important for any efficient development process to analyze reports regularly prior to actual hardware builds.

We demonstrate this step by manually picking the intermediate targets from the Makefile to generate the reports for the non-pipelined and pipelined variants of the loop_pipeline example:

vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'

make _x.hw.$PLATFORM/vadd.xo TARGET=hw PLATFORM=$PLATFORM

mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1
#################################################################################################################################
### Display hardware target.
#################################################################################################################################
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo' 'src/vector_addition_BAD.cpp'
...

INFO: [v++ 60-585] Compiling for hardware target
INFO: [v++ 60-423]   Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-242] Creating kernel: 'vadd'
...

#################################################################################################################################
### Output messages during HLS compilation.
#################################################################################################################################
===>The following messages were generated while  performing high-level synthesis for kernel: vadd ...:
INFO: [v++ 204-61] Pipelining loop 'vadd_loop'.
WARNING: [v++ 200-885] The II Violation in module 'vadd_Pipeline_vadd_loop' (loop 'vadd_loop'): ...
Resolution: For help on HLS 200-885 see www.xilinx.com/cgi-bin/docs/rdoc?v=2023.1;t=hls+guidance;d=200-885.html

#################################################################################################################################
### II = 2 indicates that the kernel is not optimally pipelined.
#################################################################################################################################
INFO: [v++ 200-1470] Pipelining result : Target II = NA, Final II = 2, Depth = 143, loop 'vadd_loop'
INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were NOT satisfied.
INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz

#################################################################################################################################
### Next level of details can be found in the system estimate reports.
#################################################################################################################################
INFO: [v++ 60-594] Finished kernel compilation
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd/system_estimate_vadd.xtxt
INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. 
    vitis_analyzer ... loop_pipeline/_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo.compile_summary

The reports reveal, that the first kernel (generated from src/vector_addition_BAD.cpp) with an II = 2 is not optimally pipelined.

Now the report for the second, pipelined kernel is generated:

vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'

make _x.hw.$PLATFORM/vadd_pipelined.xo TARGET=hw PLATFORM=$PLATFORM

mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1
#################################################################################################################################
### Display hardware target.
#################################################################################################################################
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo' 'src/vector_addition.cpp'
...

INFO: [v++ 60-585] Compiling for hardware target
INFO: [v++ 60-423]   Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-242] Creating kernel: 'vadd_pipelined'

#################################################################################################################################
### Output messages during HLS compilation.
#################################################################################################################################
===>The following messages were generated while  performing high-level synthesis for kernel: vadd_pipelined ...:
INFO: [v++ 204-61] Pipelining loop 'read_a'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_a'
INFO: [v++ 204-61] Pipelining loop 'read_b'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_b'
INFO: [v++ 204-61] Pipelining loop 'write_c'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 71, loop 'write_c'

#################################################################################################################################
### II = 1 indicates that the kernel perfectly pipelined.
#################################################################################################################################
INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were satisfied.
INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz

#################################################################################################################################
### Next level of details can be found in the system estimate reports.
#################################################################################################################################
INFO: [v++ 60-594] Finished kernel compilation
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt
INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. 
    vitis_analyzer ... /_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary

The second kernel contains three perfectly pipelined loops. However, will it actually be faster, when the three loops are executed one after another?

The pipelining properties that are output to console during HLS compilation are just the tip of the iceberg of generated reports. The next level of detail can be found in the system estimate reports at

_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd/system_estimate_vadd.xtxt for the non-pipelined loop and
_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt for the pipelined loop.

They indicate that the first, non-pipelined kernel is expected to consume fewer resources:

Area Information
Compute Unit  Kernel Name  Module Name              FF    LUT   DSP  BRAM  URAM
------------  -----------  -----------------------  ----  ----  ---  ----  ----
vadd_1        vadd         vadd_Pipeline_vadd_loop  626   1092  0    0     0
vadd_1        vadd         vadd                     1755  2391  0    2     0
-------------------------------------------------------------------------------

Compared to the second, pipelined kernel:

Area Information
Compute Unit      Kernel Name     Module Name                      FF    LUT   DSP  BRAM  URAM
----------------  --------------  -------------------------------  ----  ----  ---  ----  ----
vadd_pipelined_1  vadd_pipelined  vadd_pipelined_Pipeline_read_a   1349  190   0    0     0
vadd_pipelined_1  vadd_pipelined  vadd_pipelined_Pipeline_read_b   1363  229   0    0     0
vadd_pipelined_1  vadd_pipelined  vadd_pipelined_Pipeline_write_c  858   606   0    0     0
vadd_pipelined_1  vadd_pipelined  vadd_pipelined                   5971  4154  0    30    0
-------------------------------------------------------------------------------

More interestingly however, based on the expected trip counts that were annotate by the developer as auxiliary information to the compiler via #pragma HLS LOOP_TRIPCOUNT (see 'src/vector_addition.cpp'), the system estimates also give a first idea that the execution of one insufficiently pipelined loop from the first kernel may take longer than the combined execution time of three well pipelined loops:

Latency Information (for first, non-pipelined kernel)
Compute Unit  Kernel Name  Module Name              Start Interval  Best (cycles)  Avg (cycles)  Worst (cycles)  Best (absolute)  Avg (absolute)  Worst (absolute)
------------  -----------  -----------------------  --------------  -------------  ------------  --------------  ---------------  --------------  ----------------
vadd_1        vadd         vadd_Pipeline_vadd_loop  2190            2190           2190          2190            7.299 us         7.299 us        7.299 us
vadd_1        vadd         vadd                     undef           undef          undef         undef           undef            undef           undef

Latency Information (for second, pipelined kernel)
Compute Unit      Kernel Name     Module Name                      Start Interval  Best (cycles)  Avg (cycles)  Worst (cycles)  Best (absolute)  Avg (absolute)  Worst (absolute)
----------------  --------------  -------------------------------  --------------  -------------  ------------  --------------  ---------------  --------------  ----------------
vadd_pipelined_1  vadd_pipelined  vadd_pipelined_Pipeline_read_a   201             201            201           201             0.670 us         0.670 us        0.670 us
vadd_pipelined_1  vadd_pipelined  vadd_pipelined_Pipeline_read_b   201             201            201           201             0.670 us         0.670 us        0.670 us
vadd_pipelined_1  vadd_pipelined  vadd_pipelined_Pipeline_write_c  199             199            199           199             0.663 us         0.663 us        0.663 us
vadd_pipelined_1  vadd_pipelined  vadd_pipelined                   undef           undef          undef         undef           undef            undef           undef

5. Build the hardware design (bitstream)

This hardware build step (so-called hardware synthesis) can take lots of time and compute resources, so we create a batch script to submit the job to the slurm workload manager.

#!/bin/sh

# synthesis_script.sh

#SBATCH -t 24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH -q fpgasynthesis
#SBATCH -A <your_project_acronym>
#SBATCH -p normal

module reset
module load fpga
module load xilinx/xrt/2.15

make build TARGET=hw PLATFORM=$PLATFORM

Then, we submit the synthesis_script.sh to the slurm workload manager:

sbatch ./synthesis_script.sh

The command line arguments passed to sbatch can also be encoded inside synthesis_script.sh in lines starting with #SBATCH
Synthesis is performed via the normal partition, as no FPGA hardware is required for this step and thus no full node is blocked by a single job.
For small examples, 8 CPU cores and 45G main memory are sufficient.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.
Expected output

#################################################################################################################################
### Display hardware target.
#################################################################################################################################
mkdir -p ./_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ...  -I'src' -o'_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo' 'src/vector_addition.cpp'
Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml'

****** v++ v2023.1 (64-bit)

INFO: [v++ 60-1306] Additional information associated with this v++ compile can be found at:
  Reports: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined
  Log files: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/vadd_pipelined
Running Dispatch Server on port: 37957
INFO: [v++ 60-1548] Creating build summary session with primary output .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary, at Tue Feb 15 10:02:21 2022
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 10:02:21 2022
INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/v++_compile_vadd_pipelined_guidance.html', at Tue Feb 15 10:02:23 2022
INFO: [v++ 60-895]   Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm
INFO: [v++ 60-1578]   This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa'
INFO: [v++ 74-78] Compiler Version string: 2023.1
INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-585] Compiling for hardware target
INFO: [v++ 60-423]   Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-242] Creating kernel: 'vadd_pipelined'

#################################################################################################################################
### Output messages during HLS compilation.
#################################################################################################################################
===>The following messages were generated while  performing high-level synthesis for kernel: vadd_pipelined Log file: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined/vadd_pipelined/vitis_hls.log :
INFO: [v++ 204-61] Pipelining loop 'read_a'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_a'
INFO: [v++ 204-61] Pipelining loop 'read_b'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 73, loop 'read_b'
INFO: [v++ 204-61] Pipelining loop 'write_c'.
INFO: [v++ 200-1470] Pipelining result : Target II = 1, Final II = 1, Depth = 71, loop 'write_c'
INFO: [v++ 200-790] **** Loop Constraint Status: All loop constraints were satisfied.
INFO: [v++ 200-789] **** Estimated Fmax: 411.00 MHz
INFO: [v++ 60-594] Finished kernel compilation
INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/vadd_pipelined/system_estimate_vadd_pipelined.xtxt
INFO: [v++ 60-586] Created _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. 
    vitis_analyzer .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo.compile_summary 
INFO: [v++ 60-791] Total elapsed time: 0h 1m 7s
INFO: [v++ 60-1653] Closing dispatch client.


#################################################################################################################################
### Actual hardware build.
#################################################################################################################################
mkdir -p ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1
v++ -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 ... _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml'

****** v++ v2023.1 (64-bit)
  **** SW Build 3363252 on 2021-10-14-04:41:01
    ** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved.

INFO: [v++ 60-1306] Additional information associated with this v++ link can be found at:
  Reports: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link
  Log files: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link
Running Dispatch Server on port: 39405
INFO: [v++ 60-1548] Creating build summary session with primary output .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin.link_summary, at Tue Feb 15 10:03:31 2022
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 10:03:31 2022
INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/v++_link_vector_addition.link_guidance.html', at Tue Feb 15 10:03:33 2022
INFO: [v++ 60-895]   Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm
INFO: [v++ 60-1578]   This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa'
INFO: [v++ 74-78] Compiler Version string: 2023.1
INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-629] Linking for hardware target
INFO: [v++ 60-423]   Target device: xilinx_u280_gen3x16_xdma_1_202211_1
INFO: [v++ 60-1332] Run 'run_link' status: Not started
INFO: [v++ 60-1443] [10:03:40] Run run_link: Step system_link: Started
INFO: [v++ 60-1453] Command Line: system_link --xo .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo --xo .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo -keep --xpfm /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm --target hw --output_dir .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/int --temp_dir .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/sys_link
INFO: [v++ 60-1454] Run Directory: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/link/run_link

...

INFO: [v++ 60-244] Generating system estimate report...
INFO: [v++ 60-1092] Generated system estimate report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/system_estimate_vector_addition.link.xtxt
INFO: [v++ 60-586] Created .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.ltx
INFO: [v++ 60-586] Created ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin
INFO: [v++ 60-1307] Run completed. Additional information can be found in:
  Guidance: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/v++_link_vector_addition.link_guidance.html
  Timing Report: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/reports/link/imp/impl_1_xilinx_u280_gen3x16_xdma_1_202211_1_bb_locked_timing_summary_routed.rpt
  Vivado Log: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link/vivado.log
  Steps Log File: .../_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link/link.steps.log

INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. 
    vitis_analyzer .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin.link_summary 
INFO: [v++ 60-791] Total elapsed time: 1h 6m 50s
INFO: [v++ 60-1653] Closing dispatch client.
v++ -p ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.link.xclbin -t hw --platform xilinx_u280_gen3x16_xdma_1_202211_1 --package.out_dir ./package.hw -o ./build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
Option Map File Used: '/cm/shared/opt/Xilinx/Vitis/2023.1/data/vitis/vpp/optMap.xml'

****** v++ v2023.1 (64-bit)
  **** SW Build 3363252 on 2021-10-14-04:41:01
    ** Copyright 1986-2020 Xilinx, Inc. All Rights Reserved.

INFO: [v++ 60-1306] Additional information associated with this v++ package can be found at:
  Reports: .../_x/reports/package
  Log files: .../_x/logs/package
Running Dispatch Server on port: 37691
INFO: [v++ 60-1548] Creating build summary session with primary output .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin.package_summary, at Tue Feb 15 11:10:23 2022
INFO: [v++ 60-1316] Initiating connection to rulecheck server, at Tue Feb 15 11:10:23 2022
INFO: [v++ 60-1315] Creating rulecheck session with output '.../_x/reports/package/v++_package_vector_addition_guidance.html', at Tue Feb 15 11:10:25 2022
INFO: [v++ 60-895]   Target platform: /cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm
INFO: [v++ 60-1578]   This platform contains Xilinx Shell Archive '/cm/shared/opt/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1_3246211/hw/xilinx_u280_gen3x16_xdma_1_202211_1.xsa'
INFO: [v++ 74-78] Compiler Version string: 2023.1
INFO: [v++ 60-1302] Platform 'xilinx_u280_gen3x16_xdma_1_202211_1.xpfm' has been explicitly enabled for this release.
INFO: [v++ 60-2256] Packaging for hardware
INFO: [v++ 60-2460] Successfully copied a temporary xclbin to the output xclbin: ..././build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin
INFO: [v++ 60-2343] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command. 
    vitis_analyzer .../build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin.package_summary

To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.

In order to still use the slurm workload manager, we use a modified batch script copy_pre-synthesed_design_script.sh and submit it.

#!/bin/sh

# copy_pre-synthesed_design_script.sh

# Instead of starting the actual synthesis with `make build TARGET=hw`,
# we extract the result from an archive.
tar -xvf /opt/software/FPGA/Xilinx/Vitis/2023.1/samples/loop_pipeline.tar.gz

# We fix some timestamps as a workaround. Otherwise `make` will not work.
touch _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo
touch _x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo
touch build_dir.hw.xilinx_u280_gen3x16_xdma_1_202211_1/vector_addition.xclbin

Then, we submit the copy_pre-synthesed_design_script.sh to the slurm workload manager:

sbatch --partition=normal -A <your_project_acronym> -t 00:10:00 ./copy_pre-synthesed_design_script.sh

We submit into --partition=normal.
With -t 00:10:00, we allocate a small amount of time to this file copy job.
You can check the progress of your job via squeue and after the job completes, check the complete job output in slurm-<jobid>.out.

6. Execute the hardware design on an FPGA.

After the hardware synthesis, we can allocate a suitably configured FPGA node for execution

srun --partition=fpga -A <your_project_acronym> --constraint=xilinx_u280_xrt2.15 -t 2:00:00 --pty bash

To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node

module reset
module load fpga
module load xilinx/xrt/2.15

make run TARGET=hw PLATFORM=$PLATFORM

...
Device[0]: program successful!
|-------------------------+-------------------------|
| Kernel                  |    Wall-Clock Time (ns) |
|-------------------------+-------------------------|
| vadd: simple            |                   25082 |
| vadd: pipelined         |                   16158 |
|-------------------------+-------------------------|
| Speedup                 |                  1.5523 |
|-------------------------+-------------------------|
Note: Wall Clock Time is meaningful for real hardware execution only, not for emulation.
Please refer to profile summary for kernel execution time for hardware emulation.
TEST PASSED.

With regard to performance, we make two observations.

On the one hand, the first kernel (simple) is now indeed slower than the second (pipelined) kernel. However, the difference does not correspond to the estimate from the system estimate reports. Effects of global memory access come into play, that are not captured by the simple cycle model of the kernel logic.
Due to the non-optimized memory interface, the FPGA performance is actually slower than emulation on the host. Note, in the example application, the hardware design is also executed for many more iterations than the emulation, which makes the perceived difference appear even larger than the acutal one.

You can proceed with examples cpp_kernels/wide_mem_rw, performance/kernel_global_bandwidth and performance/hbm_bandwidth to see more optimized memory performance.