You are in the scientific computing field and you would like to discover FPGA computing ?
You are using FPGAs as electronic component in a processing chain and you would like to discover the latest tools for High Level Synthesis ?
This event, organised by Groupe Calcul and of theme C of GdR ISIS, is about getting feedbacks from those who already experienced FPGAs and put your hands on these devices and the programming models proposed by the two manufacturers: Vitis HLS (AMD-XilinX) et OneAPI (Intel). An ISIS day closes the event with three invited talks in morning, talks coming from a call for contributions in the afternoon and a round table at the end of the day.34 participants for the hands-on sessions40 participants for the ISIS day
- Christophe Alias (INRIA / LIP)
- Stefano Corda (EPFL, Switzerland)
- Mickaël Dardaillon (INSA Rennes / IETR)
- Florent De Dinechin (INSA Lyon)
- Suleyman Demirsoy (Intel)
- Steven Derrien (IRISA, Université de Rennes 1)
- Daouda Diakite (L2S - Université Paris Saclay)
- William Duluc (MVD-Training)
- Omar Hammami (ENSTA PARIS)
- Shan Mignot (Laboratoire Lagrange)
- Maurizio Paolini (Intel)
- Charles Prouveur (CEA)
- Antsa Randriamanantena (CNRS - LAB)
- Olivier Régnault (AMD-Xilinx dedicated FAE)
- Kentaro Sano (RIKEN Center for Computational Science, Japan)
- Xin Wu (Paderborn University, Germany)
One the one hand, an FPGA is a multiprocessor on a chip with up to several million elementary processors and a cumulative internal bandwidth of several Tbit/s. On the other hand, these are binary processors, and their frequency is far below what you get in a conventional CPU. Worse, compiling a program on such a chip may take several days.
All things considered, are FPGAs any good at scientific computing? The answer, of course, is "it depends", and this talk will attempt to refine this statement.
After a perfectly balanced presentation of FPGA architectures and programming models, it will provide a serenely partial and biased review of FPGA success stories in scientific computing.High-throughput and low-latency edge applications need co-designed solutions to meet the performance requirements. Quantized Neural Networks (QNNs) combined with custom FPGA dataflow implementations offer a good balance of performance and flexibility, but building such implementations by hand is difficult and time-consuming. In this presentation, we will introduce FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. It is an open-source experimental framework by Xilinx Research Labs to help the broader community to explore deep neural network (DNN) inference on FPGAs. It specifically targets QNNs, with emphasis on generating dataflow-style architectures customized for each network. It is not intended to be a generic DNN accelerator like xDNN, but rather a tool for exploring the design space of DNN inference accelerators on FPGAs. The key components are Brevitas for training quantized neural networks, the FINN compiler, and the finn-hlslib Vivado HLS library of FPGA components for QNNs.Radio-telescopes cover wide frequency bands and make extensive use of interferometry. This leads to the production of large volumes of data and calls for considerable processing. As a consequence of the limited material possibility to store the raw data, the SKA project has decided to incorporate the processing facilities to the telescopes. Two supercomputers, one for each telescope, are hence envisaged to ingest an expected representative flow of 0.77 TB/s and carry out preliminary data reduction tasks both to reduce the volume of data and yield science products. Preconstruction work has led to a concept based on a homogeneous set of nodes processing data on-the-fly or on-demand within a few days. This distinction stems from the need to provide operational feedback and average the computing load which peaks at an estimated 125 PFlops but averages out to 10 PFlops. Generic COTS systems have hitherto been considered to maximise versatility and refrain from specialising software development. However significant risks have been identified concerning cost for procurement and operations. I will present the co-design exercise which is on-going to mitigate this. In this frame, with the advent of high level synthesis, FPGAs with their higher resource utilisation and lower operating frequencies could become an option, notably for on-the-fly tasks, in-networking processing or as accelerators for selected calculation, should the risk/benefit ratio prove favourable.Many-core processors such as GPUs are currently the preferred technological target for accelerating HPC applications. However, architectures designed on FPGAs can be interesting alternatives to GPUs because they are potentially lower power and accessible thanks to the new high-level synthesis tools (HLS) provided by the leading manufacturers such as Intel or Xilinx. However, exploiting the full potential of FPGAs via HLS tools requires a deep knowledge of their architecture and a significant effort to match the application to the underlying architecture. In this presentation, I will present the principle of HLS tools as well as a methodology for FPGA acceleration through Intel's OpenCL and OneAPI tools. The 3D back-projection operator, present in iterative tomographic reconstruction algorithms, is considered as a use case for this methodology.
Matrix free conjugate gradient with Maxeler Data Flow Engine technology
Charles ProuveurIn this presentation, the implementation of a miniapp extracted from a production code in material science (Metalwalls) using Maxeler technology will be explained, after which a chip to chip comparison between a CPU, a GPU and an FPGA, as well as a scalability study on multiple FPGAs will be presented. The core algorithm is a matrix free conjugate gradient that computes the total electrostatic energy thanks to an Ewald summation at each iteration. The FPGA implementation using 40 bits floating point number representation outperforms the CPU implementation both in terms of computing power and energy usage resulting in an energy efficiency more than 14 times better. Compared to the GPU of the same generation, the FPGA reaches 60% of the GPU performance while the ratio of the performance per watt is still better by a factor of 3. Thanks to its low average power usage, the FPGA bests both fully loaded CPU and GPU in terms of number of conjugate gradient iterations per second and per watt.
AMD-Xilinx System On Chip (SoC) FPGA, une introduction + démo
Olivier RégnaultOlivier Regnault est « senior expert » pour les architectures FPGA & System on Chip (SoC) et travaille en tant que « Field Application Engineer & Product Line Manager» pour le distributeur de semi-conducteurs European Avnet Silica. L’exposé débutera avec une introduction de l’architecture AMD-Xilinx SoC, avec l’accent sur la gamme ZYNQ Ultrascale+. Suivra une démonstration de développement avec Vivado et Vitis, et, à la fin, une présentation de la nouvelle architecture Versal pour les plateformes de calcul accéléré.La présentation donnera un aperçu de l'outil Vitis HLS. Cet outil a la capacité de retranscrire un code C/C++ en langage RTL afin d'implémenter la fonction dans une architecture FPGA. Dans la deuxième partie de la présentation, certaines techniques d'optimisation qui seront utilisées dans le laboratoire pratique seront détaillées.High-level synthesis tools for FPGA such as Vitis HLS simplify the development of accelerated applications using high-level C language and combining pre-existing kernels. However connection of dataflow buffers between these kernels still need to be specified and optimized manually by the developer. In this presentation, we introduce a new method and associated tool to generate HLS code from a dataflow graph, and automatically compute buffer sizes to reach the highest throughput.
Work on the accelerated calculation of electron repulsion integrals on FPGAs using oneAPI
Xin WuThe calculation of electron repulsion integrals (ERIs) is a major bottleneck in quantum chemistry applications. In this work the accelerated calculation of ERIs is developed on Intel Stratix 10 GX 2800 FPGAs by using oneAPI as the high-level synthesis (HLS) tools. To maximize the performance the arrays for intermediate results are carefully designed by taking advantage of the FPGA local memory for parallel data accesses. Via template arguments, multiple different kernel variants for different angular momenta of the input electrons get generated, which allows to fully unroll inner loops with recursive dependencies. Our FPGA kernels for ERIs of high angular momenta outperform the libint library on a compute node with 2 CPU sockets by about 4x. A performance model is established to explain the measured FPGA performance.The presentation will provide an overview of the oneAPI initiative and Intel products for heterogeneous computing. It will then cover the conceptual differences between coding for FPGAs and coding for CPUs/GPUs and describe the development flow specific to FPGA platforms. In the second part of the presentation, some basic oneAPI design techniques for FPGA to be used in the hands-on lab will be detailed.
ESSPER: FPGA Cluster for Research on Reconfigurable HPC with Supercomputer Fugaku
Kentaro SanoAt RIKEN Center for Computational Science (R-CCS), we have been developing an experimental FPGA Cluster named "ESSPER (Elastic and Scalable System for high-PErformance Reconfigurable computing)," which is a research platform for reconfigurable HPC. ESSPER is composed of sixteen Intel Stratix 10 SX FPGAs which are connected to each other by a dedicated 100Gbps inter-FPGA network. We have developed our own Shell (SoC) and its software APIs for the FPGAs supporting inter-FPGA communication. The FPGA host servers are connected to a 100Gbps Infiniband switch, which allows distant servers to remotely access the FPGAs by using a software bridged Intel's OPAE FPGA driver, called R-OPAE. By 100Gbps Infiniband network and R-OPAE, ESSPER is actually connected to the world's fastest supercomputer, Fugaku, deployed in RIKEN, so that using Fugaku we can program bitstreams onto FPGAs remotely using R-OPAE, and off-load tasks to the FPGAs. In this talk, I introduce our ESSPER's concept, system stack of hardware and software, programming environment, under-development applications as well as our future prospects for reconfigurable HPC.
Reduced-Precision Acceleration of Radio-Astronomical Imaging on Xilinx FPGAs
Stefano CordaModern radio telescopes such as the Square Kilometre Array (SKA) produce large volumes of data that need to be processed to obtain high-resolution sky images. This is a complex task that requires computing systems that provide both high performance and high energy efficiency. Hardware accelerators such as GPUs (Graphics Processing Units) and FPGAs (Field Programmable Gate Arrays) can provide these two features and are thus an appealing option for this application. Most HPC (High-Performance Computing) systems operate in double precision (64-bit) or in single precision (32-bit), and radio-astronomical imaging is no exception. With reduced precision computing, smaller data types (e.g., 16-bit) aim at improving energy efficiency and throughput performance in noise-tolerant applications. We demonstrate that reduced precision can also be used to produce high-quality sky images. To this end, we analyze the gridding component (Image-Domain Gridding) of the widely-used WSClean imaging application. Gridding is typically one of the most time-consuming steps in the imaging process and, therefore, an excellent candidate for acceleration. We identify the minimum required exponent and mantissa bits for a custom floating-point data type. Then, we propose the first custom floating-point accelerator on a Xilinx Alveo U50 FPGA using High-Level Synthesis. Our reduced-precision implementation improves the throughput and energy efficiency by respectively 1.84x and 2.03x compared to the single-precision floating-point baseline on the same FPGA. Our solution is also 2.12x faster and 3.46x more energy-efficient than an Intel i9 9900k CPU (Central Processing Unit) and manages to keep up in throughput with an AMD RX 550 GPU.
Heterogeneous Embedded Multicore Design Graduate Education in ENSTA PARIS: A 5 years Feedback
Omar HammamiIn this talk we will present a 5 years feedback on training graduate level students in the oldest school of engineers in France, ENSTA PARIS on heterogeneous embedded multicore design on Xilinx SOC Zynq chip. Part of the ROB 307 MPSOC (Multiprocessor System on Chip) course students are required to design a Heterogeneous Embedded Multicore combining a dual core hardcore IP (ARM9, 4 soft cores Microblaze, 2 hardware accelerators (Neural network, vision, image processing) and a AXI NOC (Network on Chip) on a single Zynq XC7Z020 chip using a zedboard. Students are expected to validate their design through actual execution on the zedboard and have all IPs running concurrently. This project have been going on for the past 5 yeat and we will share our experience in this training.Hardware accelerators are unavoidable to improve the performance of computers with a bounded energy budget. In particular, FPGA allow building dedicated circuits from a gate-level description, allowing a very advanced level of optimization. Tools for high-level synthesis (HLS) allow the programmer to program FPGA without the constraints linked to hardware, compiling a C specification into a circuit. Code optimizations in these tools remain rudimentary (loop unrolling, pipelining, etc.), and most often the responsibility of the programmer. Polyhedral model, born from research on systolic circuits, offer a powerful tool to optimize compute kernels for HPC. In this seminar, I will show a few interconnections between HLS and the polyhedral model, either as a preprocessing (source-to-source) step, or as a synthesis tool (optimizing the circuit using a dataflow intermediate representation). In particular, I will present a dataflow formalism that allow reasoning geometrically on circuit synthesis.Loop pipelining (LP) is a key optimization in modern high-level synthesis (HLS) tools for synthesizing efficient hardware datapaths. Existing techniques for automatic LP are limited by static analysis that cannot precisely analyze loops with data-dependent control flow and/or memory accesses. We propose a technique for speculative LP that handles both control-flow and memory speculations in a unified manner. Our approach is entirely expressed at the source level, allowing a seamless integration to development flows using HLS. Our evaluation shows significant improvement in throughput over standard loop pipelining techniques.
Using Unified Shared Memory and External Function Interface with oneAPI
Suleyman DemirsoyUnified Shared memory(USM) abstraction offers significant ease of use and, in some cases, performance benefits when critical functions are offloaded to an accelerator such as FPGA. Some of the critical functions would also benefit from lower level customization that is possible at RTL level but not easy to capture at the oneAPI code. In this talk, we will look more closely into both topics as a follow up on the main oneAPI introduction presented earlier in the conference.
- Mickaël Dardaillon (INSA Rennes / IETR)
- Nicolas Gac (L2S - Université Paris Saclay)
- Matthieu Haefele (CNRS/UPPA)
- Shan Mignot (Laboratoire Lagrange)
- Charles Prouveur (CEA)
- Antsa Randriamanantena (CNRS/LAB)
- Bogdan Vulpescu (CNRS/IN2P3/UCA)