When to consider FPGAs

For some workloads, FPGAs can provide great acceleration over CPU or GPU references, for other workloads, FPGAs are on par in terms of performance and only superior in energy efficiency, and for other workloads FPGAs are not a fitting architecture. As first rule of thumb for current generation devices, if your code runs close to peak performance on a CPU or GPU, it’s probably not worthwhile to look into FPGA acceleration. In the following, we give some indications when it is promising to consider FPGAs further.

1 Low-latency direct communication and overlap of execution and communication
2 Bit-level operations and data
3 Problem sizes that are insufficient to fully exploit the parallel potential of GPUs or CPUs
4 Specific access patterns to on-chip memory and co-design with parallelism
5 Irregular access to off-chip memory
6 Specific mix of operations with algebraic or transcendental functions

Low-latency direct communication and overlap of execution and communication

The FPGA-to-FPGA networking features of the PC2 infrastructure is an essential building block when parallel scaling to multiple FPGAs needs to be combined with other features that make FPGA acceleration promising. On its own it's not a sufficient criterion for speedup potential, but it can provide additional advantages like ideal overlapping of streaming computation with streaming communication, or lower latency communication.

Bit-level operations and data

FPGAs are always a promising architecture for applications where performance critical parts

use lots of bit-level operations (e.g. AND, OR, XOR, shifts)
or data can be encoded with few bits (e.g. DNA base pairs or amino acids, or binary/heavily quantized weights and inputs in neural networks).

A promising domain for FPGA acceleration is bioinformatics, where often data structures with efficient bit-level representations are used.
- The prototypical algorithm with FPGA support in the domain of bioinformatics is genome sequence alignment with the Smith-Waterman algorithm, where both the DNA base pairs and the scoring weights can be efficiently encoded in efficient custom bitwidth formats.
  - One of the foundational publications about Smith-Waterman on FPGA from 2003 presents an approach with up to 814 GCUPS (Giga Cell Updates per Second) theoretical peak performance of the processing core on FPGA: A Smith-Waterman Systolic Cell
  - In a later adaptation on FPGA systems at PC2, up to 1663 GCUPs or up to 91.5 M complete alignment pairs were reported: https://ris.uni-paderborn.de/record/35131
- In a more exploratory work, encoding the graph structure of a phylogenetic tree with a bit-level represenation and realizing update and scoring operations on the tree with logical operators and bit counting (popcount) allowed to achieve speedups over the CPU reference and power efficiency gains even over an idealized CPU performance proxy that focuses only on the scoring operations: https://ris.uni-paderborn.de/record/46190
The calculation of a 32 year old mathematical challenge could be formulated in a way that required to many trillion of small graph problems that could be efficiently implemented with bit-level operations on FPGAs: https://ris.uni-paderborn.de/record/43439
Many neural network implementations have been successfully quantized to very low bitwidth weights and activations, in some cases even down to binary representations. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference has introduced a tool flow to customize such neural networks for particularly low-latency and efficient inference on FPGAs. Within the EKI project, PC2 is advancing this technology for the usage on Alveo FPGA cards in the Noctua 2 cluster with a specific focus on scaling neural networks over multiple directly communicating FPGAs.

Problem sizes that are insufficient to fully exploit the parallel potential of GPUs or CPUs

FPGAs can be a promising architecture for applications where relevant inputs are not sufficiently large to get close to peak performance on GPUs or CPUs. Even if this is not the case on a single device or single compute node, In strong scaling scenarios, this situation can easily occur when trying to solve the same workload faster by using many devices. In this case, direct communication between FPGAs can be the second enabler to achieve overall highest performance on FPGAs.

GPUs with their tens of thousands of small cores often require millions of independent work items to reach their peak performance, CPUs with tens to hundreds of cores often require hundreds of thousands of independen work items. Through the combination of parallelism on different levels, e.g. data parallelism, pipeline parallelism, and task level parallelism, FPGAs can often achieve their peak performance already with hundreds or thousands of independent work items.

One domain where this advantage has been used are classical molecular dynamics simulations with exemplary implementations as Fully Integrated FPGA Molecular Dynamics Simulations on a single FPGA and Microsecond simulation in a special-purpose molecular dynamics computer cluster. With an in-depth comparison to highly optimized and vectorized CPU code, but limited to one specific kernel component, results in this direction have also been demonstrated on PC2 FPGA infrastructure as https://ris.uni-paderborn.de/record/28099.
Another application area where advantages with small datasets and of multi-FPGA scaling have been explored at PC2 are shallow water simulations. https://ris.uni-paderborn.de/record/54312 shows for a portable single device implementation that FPGAs are fastest for small datasets, even though the specific FPGA implementation that combines caches with off-chip global memory favors flexibility over optimized performance for any specific data size. The works in https://ris.uni-paderborn.de/record/46188 and https://ris.uni-paderborn.de/record/53364 demonstrate multi-FPGA scaling for Altera/Intel and AMD/Xilinx FPGAs at PC2 respectively that profit from the specific networking capabilities provided for both types of FPGAs in PC2 infrastructure.

Specific access patterns to on-chip memory and co-design with parallelism

FPGAs can be a promising architecture for applications where CPUs or GPUs are not operating close to their peak performance due to inefficient usage of their cache hierarchy or shared memory.

On-chip block RAM resources on FPGAs are not necessarily larger or per se faster than CPU or GPU caches, but their layout and access patterns can be customized to application demands. In particular when parallel accesses can be partitioned well, up to thousands of regular or even random accesses into small chunks of memory can be served deterministically within each clock cycle. Additionally, as counterparts to CPU or GPU register files, FPGAs also offer more flexible in-fabric memory resources in the form of individual registers as well as the option to store data in ALMs or LUT-RAM.

For Long-Range MD Electrostatics Force Computation on FPGAs, it has been shown how a multi-banked local memory buffer can be designed to deterministically allow for 64 parallel memory accesses along three spatial directions as well as for 3D blocks of size 4x4x4.
For the computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals, multi-dimensional arrays of intermediate values need to be calculated and accessed with a high degree of parallelism for best performance. In Computing and Compressing Electron Repulsion Integrals on FPGAs, we present customized FPGA designs for different angular momenta, mapping small arrays of intermediate values entirely to FPGA registers that allow full parallel access, and larger intermediate arrays into block RAMs that are partitioned to support the required parallel access patterns.
In several stencil applications and libraries, such as StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems, and in the StencilStream library developed at PC2, customized local memory layouts including block RAMs and FIFO registers are used to maximize temporal reuse of data on FPGAs to achieve acceleration or competitive performance with reduced off-chip bandwidth usage. With StencilStream, it is particularly easy to port new stencil applications to FPGAs.

Irregular access to off-chip memory

FPGAs can be a promising architecture when CPUs or GPUs can’t fully utlize their off-chip memory bandwidth due to irregular data access or control flow.

The domain of sparse linear algebra is notoriously limited by off-chip bandwidth. FPGAs can use the available bandwidth particularly efficiently, as shown for example in HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reductions on FPGAs with High Bandwidth Memory (github link), however a close comparison to the effectively achieved bandwidth on alternative GPU or CPU platforms is recommended for a first estimate of the potential for FPGA acceleration.
A similar pattern occurs in the domain of graph processing, as shown for example with ACTS: A Near-Memory FPGA Graph Processing Framework. It is unclear if the speedup potential in this domain is actually higher than for sparse linear algebra, or if the GPU comparison is not sufficiently optimized. In https://ris.uni-paderborn.de/record/53503, two programming approaches for graph processing on FPGAs have been compared on PC2 infrastructure.

Specific mix of operations with algebraic or transcendental functions

FPGAs can be a promising architecture when CPUs or GPUs are limited by the throughput of algebraic or transcendental functions like sqrt, sin, cos, exp or erf in the required accuracy.

In the Metalwalls software for molecular dynamics simulations for electrochemical systems, the throughput oferfand trigonometric functions are performance critical components. In https://ris.uni-paderborn.de/record/46189, it was shown that FPGAs with a custom precision implementation can provide performance and efficiency advantages. However, with the FPGA Software and Firmware Stacks available at PC2, it’s currently still difficult to fully exploit this potential.
In several applications,sqrtoperations for distance calculations can become critical for performance. In High Performance Computing PP-Distance Algorithms to Generate X-ray Spectra from 3D Models as well as in the work https://ris.uni-paderborn.de/record/28099 already referenced above, this has contributed to performance gains on FPGAs.