It’s no secret that we are drowning in data. Today’s applications and algorithms require almost incomprehensible amounts of data, and that means the bandwidth requirements are exploding faster than networking and memory technologies can handle. Even with the most advanced accelerators we can build with our FPGAs, we can be choked trying to get data on and off the chip and finding places to store information as we are processing.
Even though memory bandwidth has been increasing rapidly, the demand is growing faster. Pushing around zettabytes of information worldwide has stressed current technologies to the breaking point. Pushing performance-critical tasks off to FPGAs doesn’t help if the system is starved for memory bandwidth.
At the same time, more and more of that data needs to be secured, and every time data is moved across an interface, it becomes vulnerable.
What we need is to move the memory closer to the processing.
Xilinx has taken a big step toward memory localization with their new Versal HBM series of “ACAP” devices (we think of them as FPGAs). HBM (or high-bandwidth memory) is designed to sit in the same package with other processing elements, communicating via stacked-silicon interconnect (SSI) advanced packaging technology. By keeping the memory in-package, much higher-bandwidth connections are possible, and avoiding off-chip memory interfaces significantly reduces power consumption and interface latency.
This is far from Xilinx’s first rodeo with SSI. The company was a pioneer in silicon interposers with FPGA years ago, and this new device is built on fourth-generation SSI. Early on, SSI was used primarily to increase effective yield by packing several smaller FPGA chiplets into a single package to build a larger FPGA. But today, SSI is also used to make Xilinx’s silicon more scalable and versatile. To build Versal HBM, for example, they just swapped out one “Super Logic Region” (SLR) chiplet for an HBM2e stack from their Versal Premium device to build Versal HBM. (OK, it’s a little bit more complicated than that, but you get the idea.)
Compared with external DDR5, in-package HBM offers 8x the bandwidth at 63% lower power. And that’s a big deal. Parking an HBM stack inside your FPGA gives you a memory bandwidth bonanza, while saving your power budget for processing.
This is not the first time Xilinx has popped HBM into one of their devices. One version of their previous-generation Virtex Ultrascale+ FPGAs featured in-package HBM. The new Versal HBM outperforms that one in every axis, however, with 1.8x the memory bandwidth (from 460Gbps to 820Gbps) at 15% lower power and 2x the HBM memory capacity (32GB vs 16GB).
Versal HBM has a lot more than just more memory bandwidth, though. They’ve also significantly increased the size of the SerDes pipes for getting data on and off the device, doubling the total bandwidth to a mind-bending 5.6Tb/s. The SerDes is scalable for maximum application flexibility, with 32Gbps NRZ for power-optimized 100G interfaces, 58Gbps PAM4 for the current 400G ramp and deployment, and super-sporty 112Gbps PAM4 for future 800 gig network development on 100G per lane optics.
Many standard interfaces are pre-built and hardened for you, including 2.4Tb/s of scalable Ethernet bandwidth that offers multi-rate: 400/200/100/50/40/25/10G with FEC, and multi-standard: FlexE, Flex-O, eCPRI, FCoE, and OTN. Security can be done quickly with 1.2Tb/s of line rate encryption throughput delivered by bulk Crypto AES-GCM-256/128, MACsec, IPsec, which Xilinx claims this is the “World’s only hardened 400G Crypto Engine on an adaptable platform.”
If PCIe is your jam, Versal HBM packs 1.5Tb/s of aggregated PCIe link bandwidth via PCIe Gen5 with DMA, CCIX, and CXL (yep, playing for either team now). The PCIe interface has dedicated connectivity over the programmable network-on-chip (NoC) to memory.
So, Versal HBM can obviously do a super job getting data onto and off the chip and parking it in memory while it’s there. But, what about the ability to do actual work?
The new device has a triple-header of capabilities to execute and accelerate a wide variety of workloads. Xilinx now refers to these as “engines, and Versal HBM (like their other ACAP devices) includes “Scalar,” “Adaptable,” and “DSP” engines. In more conventional terms, the “Scalar” engines are Arm-based processing systems consisting of dual-core Arm Cortex-A72 application processors and dual-core Arm Cortex-R5F Real-Time processors. The “Adaptable” engines are primarily what we’d think of as FPGA LUT fabric (3.8 or 5.6M logic cells worth), and the “DSP” engines consist of 7.4K or 10.9K DSP slices. Taken together, that’s an impressive amount of compute resources to tackle the tough problems in networking, data center, test and measurement, and aerospace and defense – the target markets for Versal HBM.
Xilinx provided a couple of benchmarks. In the healthcare arena, on the Real-Time Recommendation Engine – Cosine similarity algorithm – Clinical outcome predictions, in which they claim Versal HBM can handle 2x the patient database size of the previous-generation Virtex UltraScale+ and 4x the size of a 3rd gen Intel x867 Xeon gold/platinum scalable processor. Speed-wise, they claim 100x the speed of the Virtex and 200x that of the x86.
The second benchmark is in real-time fraud detection – Louvain modularity algorithm – to detect anomalies in behavior/transactions. (You know, when the credit card company calls and asks if you just bought a Ferrari on Easter Island.) In this example, they claim the same 2x and 4x capacity advantage (number of vertices), and a more modest 10x and 20x speed advantage over Virtex and x86 respectively.
If piles of chips are more your benchmark style, Xilinx says Versal HBM packs the equivalent of 14 Virtex UltraScale devices, with the equivalent of 32 DDR5 chips-worth of HBM.
Versal HBM will come in 2 basic sizes, but with 3 different helpings of HBM – 8, 16, or 32GB. You can get started on your design now with the Versal Premium series (which is basically the same as the Versal HBM, but without the HBM). Documentation is available now, tools the second half of 2021, and devices begin sampling the second half of 2022.