Xilinx Vitis Quick Start Guide

Run your first design on a Xilinx Alveo U280 card in 6 simple steps.

1. Get the latest examples from the Xilinx repository.

The latest Vitis Accel Examples are available on github. The master branch of the repository is always under development for the next release and might be incompatible with the latest version installed on our systems. Therefore, we directly checkout checkout a version specific branch that matches the Vitis version you are going to use in this guide.

git clone --branch 2023.1 https://github.com/Xilinx/Vitis_Accel_Examples.git cd Vitis_Accel_Examples/cpp_kernels/loop_pipeline

In their example repository for acceleration with the Vitis tools, Xilinx grouped the examples by strategies for kernel development

  • C++ based (subdirectory cpp_kernels, recomended start for new users)

  • OpenCL kernel language based (subdirectory ocl_kernels)

  • RTL based (subdirectory rtl_kernels, recommended for experts only)

and by features for the host interface

  • based on the Xilinx xrt API (subdirectory host_xrt)

  • based on the OpenCL host API (subdirectory host)

  • based on Python (subdirectory host_py)

In this quick start we pick an example of C++ based kernels and the test uses the OpenCL host API, but feel free to explore the other options.

2. Setup the local software environment on Noctua 2.

module reset module load fpga module load xilinx/xrt/2.15

With module reset, previously loaded modules are cleaned up. The first module loaded, fpga, is a gateway module to the actual module loaded in line 3. Without version number provided, the latest xilinx/xrt/2.15 module will be loaded. Under the hood it loads further modules for fitting versions of Vitis and the U280 shell. Together, these modules setup paths and environment variables, some of which are used in the examples Makefile to specify the U280 as target card. Observe for example:

echo $DEVICE xilinx_u280_gen3x16_xdma_1_202211_1 echo $PLATFORM_REPO_PATHS /opt/software/FPGA/Xilinx/platforms/xilinx_u280_gen3x16_xdma_1_202211_1

If you have a project that was only validated with an older version of xrt or Vitis (also the example repository contains branches and tags for older tool versions), you can explicitly load the module for an older version of xrt, e.g. xilinx/xrt/2.8 (oldest supported version).

Supported xrt versions and connected Vitis and shell versions:

3. Build and test the example in emulation.

Under the hood, the makefile performs four main steps here

  • Creating the host binary

    • g++ -o loop_pipeline ...

  • Creating the FPGA emulation binary with four substeps

    • Building an object file for a non-pipelined loop

      • v++ -t sw_emu ... 'src/vector_addition_BAD.cpp'

    • Building an object file for a pipelined loop

      • v++ -t sw_emu ... 'src/vector_addition.cpp'

    • Linking the two object files

      • v++ -t sw_emu ... -l ... _x.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vadd.xo _x.sw_emu.xilinx_u280_gen3x16_xdma_1_202211_1/vadd_pipelined.xo

    • Packaging the result to emulate execution of U280 on the CPU

      • v++ -p ...

  • Preparing an emulation environment

    • emconfigutil --platform xilinx_u280_gen3x16_xdma_1_202211_1 ...

  • Running the emulation

  • Expected output

This command will build and run the loop_pipeline example in two variants

  • vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'

  • vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'

Note: the expected output of the example contains performance figures measured in the host code. As noted even in the output, during emulation these measurements are no indication at all about the performance that is to be expected from hardware execution. In this example, the pipelined code, much faster on the FPGA, is slower in emulation.

4. Create and inspect reports as indicator of expected HW performance.

The Makefile in the example doesn’t contain an explicit target for reports. Reports get generated during the high-level-synthesis step when translating the C++ code to a hardware description. Since it is crucially important for any efficient development process to analyze reports regularly prior to actual hardware builds.

We demonstrate this step by manually picking the intermediate targets from the Makefile to generate the reports for the non-pipelined and pipelined variants of the loop_pipeline example:

  • vadd: simple (non pipelined loop) with source in 'src/vector_addition_BAD.cpp'

The reports reveal, that the first kernel (generated from src/vector_addition_BAD.cpp) with an II = 2 is not optimally pipelined.

Now the report for the second, pipelined kernel is generated:

  • vadd: pipelined (pipelined loop) with source in 'src/vector_addition.cpp'

The second kernel contains three perfectly pipelined loops. However, will it actually be faster, when the three loops are executed one after another?

5. Build the hardware design (bitstream)

This hardware build step (so-called hardware synthesis) can take lots of time and compute resources, so we create a batch script to submit the job to the slurm workload manager.

Then, we submit the synthesis_script.sh to the slurm workload manager:

To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.

6. Execute the hardware design on an FPGA.

After the hardware synthesis, we can allocate a suitably configured FPGA node for execution

To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node

With regard to performance, we make two observations.

  • On the one hand, the first kernel (simple) is now indeed slower than the second (pipelined) kernel. However, the difference does not correspond to the estimate from the system estimate reports. Effects of global memory access come into play, that are not captured by the simple cycle model of the kernel logic.

  • Due to the non-optimized memory interface, the FPGA performance is actually slower than emulation on the host. Note, in the example application, the hardware design is also executed for many more iterations than the emulation, which makes the perceived difference appear even larger than the acutal one.

You can proceed with examples cpp_kernels/wide_mem_rw, performance/kernel_global_bandwidth and performance/hbm_bandwidth to see more optimized memory performance.