Xilinx Vitis Quick Start Guide
Run your first design on a Xilinx Alveo U280 card in 6 simple steps.
1. Get the latest examples from the Xilinx repository.
The latest Vitis Accel Examples are available on github. The master
branch of the repository is always under development for the next release and might be incompatible with the latest version installed on our systems. Therefore, we directly checkout checkout a version specific branch that matches the Vitis version you are going to use in this guide.
git clone --branch 2023.1 https://github.com/Xilinx/Vitis_Accel_Examples.git
cd Vitis_Accel_Examples/cpp_kernels/loop_pipeline
2. Setup the local software environment on Noctua 2.
module reset
module load fpga
module load xilinx/xrt/2.15
3. Build and test the example in emulation.
This command will build and run the loop_pipeline
example in two variants
vadd: simple (non pipelined loop) with source in
'src/vector_addition_BAD.cpp'
vadd: pipelined (pipelined loop) with source in
'src/vector_addition.cpp'
Note: the expected output of the example contains performance figures measured in the host code. As noted even in the output, during emulation these measurements are no indication at all about the performance that is to be expected from hardware execution. In this example, the pipelined code, much faster on the FPGA, is slower in emulation.
4. Create and inspect reports as indicator of expected HW performance.
The Makefile in the example doesn’t contain an explicit target for reports. Reports get generated during the high-level-synthesis step when translating the C++ code to a hardware description. Since it is crucially important for any efficient development process to analyze reports regularly prior to actual hardware builds.
We demonstrate this step by manually picking the intermediate targets from the Makefile to generate the reports for the non-pipelined and pipelined variants of the loop_pipeline
example:
vadd: simple (non pipelined loop) with source in
'src/vector_addition_BAD.cpp'
The reports reveal, that the first kernel (generated from src/vector_addition_BAD.cpp
) with an II = 2 is not optimally pipelined.
Now the report for the second, pipelined kernel is generated:
vadd: pipelined (pipelined loop) with source in
'src/vector_addition.cpp'
The second kernel contains three perfectly pipelined loops. However, will it actually be faster, when the three loops are executed one after another?
5. Build the hardware design (bitstream)
This hardware build step (so-called hardware synthesis) can take lots of time and compute resources, so we create a batch script to submit the job to the slurm workload manager.
Then, we submit the synthesis_script.sh
to the slurm workload manager:
To speed-up the process and save resources for unnecessary synthesis we have pre-synthesized the design. Expand the box below to copy the pre-synthesized design for hardware execution.
6. Execute the hardware design on an FPGA.
After the hardware synthesis, we can allocate a suitably configured FPGA node for execution
To run the design, we load the proper modules and use the corresponding make command on the allocated FPGA node
With regard to performance, we make two observations.
On the one hand, the first kernel (
simple
) is now indeed slower than the second (pipelined
) kernel. However, the difference does not correspond to the estimate from the system estimate reports. Effects of global memory access come into play, that are not captured by the simple cycle model of the kernel logic.Due to the non-optimized memory interface, the FPGA performance is actually slower than emulation on the host. Note, in the example application, the hardware design is also executed for many more iterations than the emulation, which makes the perceived difference appear even larger than the acutal one.
You can proceed with examples cpp_kernels/wide_mem_rw
, performance/kernel_global_bandwidth
and performance/hbm_bandwidth
to see more optimized memory performance.
Â