When to consider FPGAs

For some workloads, FPGAs can provide great acceleration over CPU or GPU references, for other workloads, FPGAs are on par in terms of performance and only superior in energy efficiency, and for other workloads FPGAs are not a fitting architecture. As first rule of thumb for current generation devices, if your code runs close to peak performance on a CPU or GPU, it’s probably not worthwhile to look into FPGA acceleration. In the following, we give some indications when it is promising to consider FPGAs further.

Low-latency direct communication and overlap of execution and communication

The FPGA-to-FPGA networking features of the PC2 infrastructure is an essential building block when parallel scaling to multiple FPGAs needs to be combined with other features that make FPGA acceleration promising. On its own it's not a sufficient criterion for speedup potential, but it can provide additional advantages like ideal overlapping of streaming computation with streaming communication, or lower latency communication.

Bit-level operations and data

FPGAs are always a promising architecture for applications where performance critical parts

  • use lots of bit-level operations (e.g. AND, OR, XOR, shifts)

  • or data can be encoded with few bits (e.g. DNA base pairs or amino acids, or binary/heavily quantized weights and inputs in neural networks).

  • A promising domain for FPGA acceleration is bioinformatics, where often data structures with efficient bit-level representations are used.

  • The calculation of a 32 year old mathematical challenge could be formulated in a way that required to many trillion of small graph problems that could be efficiently implemented with bit-level operations on FPGAs: https://ris.uni-paderborn.de/record/43439

  • Many neural network implementations have been successfully quantized to very low bitwidth weights and activations, in some cases even down to binary representations. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference has introduced a tool flow to customize such neural networks for particularly low-latency and efficient inference on FPGAs. Within the EKI project, PC2 is advancing this technology for the usage on Alveo FPGA cards in the Noctua 2 cluster with a specific focus on scaling neural networks over multiple directly communicating FPGAs.

Problem sizes that are insufficient to fully exploit the parallel potential of GPUs or CPUs

FPGAs can be a promising architecture for applications where relevant inputs are not sufficiently large to get close to peak performance on GPUs or CPUs. Even if this is not the case on a single device or single compute node, In strong scaling scenarios, this situation can easily occur when trying to solve the same workload faster by using many devices. In this case, direct communication between FPGAs can be the second enabler to achieve overall highest performance on FPGAs.

GPUs with their tens of thousands of small cores often require millions of independent work items to reach their peak performance, CPUs with tens to hundreds of cores often require hundreds of thousands of independen work items. Through the combination of parallelism on different levels, e.g. data parallelism, pipeline parallelism, and task level parallelism, FPGAs can often achieve their peak performance already with hundreds or thousands of independent work items.

Specific access patterns to on-chip memory and co-design with parallelism

FPGAs can be a promising architecture for applications where CPUs or GPUs are not operating close to their peak performance due to inefficient usage of their cache hierarchy or shared memory.

On-chip block RAM resources on FPGAs are not necessarily larger or per se faster than CPU or GPU caches, but their layout and access patterns can be customized to application demands. In particular when parallel accesses can be partitioned well, up to thousands of regular or even random accesses into small chunks of memory can be served deterministically within each clock cycle. Additionally, as counterparts to CPU or GPU register files, FPGAs also offer more flexible in-fabric memory resources in the form of individual registers as well as the option to store data in ALMs or LUT-RAM.

  • For the computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals, multi-dimensional arrays of intermediate values need to be calculated and accessed with a high degree of parallelism for best performance. In https://ris.uni-paderborn.de/record/54312, we present customized FPGA designs for different angular momenta, mapping small arrays of intermediate values entirely to FPGA registers that allow full parallel access, and larger intermediate arrays into block RAMs that are partitioned to support the required parallel access patterns.

  • In several stencil applications and libraries, such as StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems, and in the StencilStream library developed at PC2, customized local memory layouts including block RAMs and FIFO registers are used to maximize temporal reuse of data on FPGAs to achieve acceleration or competitive performance with reduced off-chip bandwidth usage. With StencilStream, it is particularly easy to port new stencil applications to FPGAs.

Irregular access to off-chip memory

FPGAs can be a promising architecture when CPUs or GPUs can’t fully utlize their off-chip memory bandwidth due to irregular data access or control flow.

Specific mix of operations with algebraic or transcendental functions

FPGAs can be a promising architecture when CPUs or GPUs are limited by the throughput of algebraic or transcendental functions like sqrt, sin, cos, exp or erf in the required accuracy.