Intel has announced what they call their “First Intel AI-Optimized FPGA,” the Stratix 10 NX family. The company says these FPGAs “will offer customers customizable, reconfigurable and scalable AI acceleration for compute-demanding applications such as natural language processing and fraud detection.” Intel has bet on all the horses in the AI race, adding “Deep Learning Boost (DL Boost) to their flagship Xeon processors to dramatically accelerate AI inference in conventional data center processors, but also investing heavily in acceleration strategies such as FPGAs, acquisition of Habana Labs, Nervana (whose technology has now been reportedly dropped in favor of Habana Labs technology), and Movidius (focused on low-end AI inference).
Intel’s strategy is clearly morphing toward heterogeneous computing in the data center, where variations in workloads can have a dramatic impact on the optimal hardware configuration. For many (if not most) data center applications, the performance of Xeon with DL Boost will be adequate for the AI tasks that come along. However, if there is a heavy load of AI demand combined with low latency requirements, it makes more sense to add workload-specific acceleration. This is where Stratix 10 NX is likely to come into play.
This new Stratix 10 NX family is a continuation of the Stratix 10 line introduced back in 2015 when Intel had just announced “plans” to buy Altera. Stratix 10 is fabricated on Intel’s 14nm 3D tri-gate (FinFET) technology, and it makes extensive use of Intel’s proprietary embedded multi-die interconnect bridge (EMIB) technology to allow the company to deploy a wide variety of devices and device families for specialized application domains by mixing and matching chiplets in a single package. EMIB is Intel’s alternative to a silicon interposer for in-package high density interconnect of heterogeneous chips.
During the wait for the deployment of the next-generation Agilex family (which is based on Intel’s delayed 10nm process and began shipping to early-access customers last August), Intel has rolled out several new variants of Stratix 10 by taking advantage of EMIB’s ability to combine chiplets into domain-focused solutions. Stratix 10 now consists of six different variants, GX – which are general-purpose FPGAs, SX – which is Intel’s SoC FPGA that includes hard processor subsystems with 64-bit quad-core ARM Cortex-A53s, TX – which is the transceiver-heavy variant with tons of PAM4 57.8 Gbps transceivers, MX – which includes in-package HBM2, DX – which supports Intel Ultra Path Interconnect (Intel UPI) for direct coherent connection to future select Intel Xeon processors, and now NX – which emphasizes low-precision computation.
In the case of the new Stratix 10 NX, the company is going after the AI inference market primarily via new AI-optimized arithmetic blocks called AI Tensor Blocks. These blocks would have previously been called “DSP” blocks, but the new versions contain dense arrays of lower-precision multipliers typically used for AI model arithmetic. Intel’s previous devices focused on higher-precision multiplication and even floating point, which would be useful in targeting AI training, but inference acceleration is all about low-precision, and Intel has answered with 15x more INT8 performance than the standard Stratix 10 DSP Block.
Intel gets this INT8 boost by swapping the usual 2-multiplier 2-accumulator architecture of the Stratix 10 MX block for a 30-multiplier, 30-accumulator block that can handle INT4, INT8, BLOCK FP12, and BLOCK FP16. Presumably, the “15x” factor is due to going from 2 MACs to 30 MACs in the same block for lower precision operations. Stratix 10 NX also boasts up to 16GB in-package stacked HBM (like the MX line) and PCIe Gen3x16 plus PCIe Gen4x16 support.
Intel says Stratix 10 NX is up to 2.3X faster than Nvidia V100 GPUs for BERT batch processing, 9.5X faster in LSTM batch processing, and 3.8X faster in ResNet50 batch processing. The targeting of NVidia in their marketing materials clearly illuminates Intel’s strategy for Stratix 10 NX. The company wants to stop NVidia’s incursion into the data center at any cost.
Of course, as we have discussed many times, the big challenge with taking advantage of FPGA performance and power efficiency in compute acceleration is the programming model. Creating an optimized accelerator using FPGAs traditionally requires a team with significant FPGA expertise and experience, and a lot of time. Intel and rival FPGA supplier Xilinx have worked hard over the past decade or so to improve that situation, and Intel seems to be hanging their hat on their ambitious “oneAPI” which is a standards-based, unified programming model that aims to facilitate integration of heterogeneous Xeon-based platforms with various accelerators such as FPGAs. Intel’s approach makes sense, given the breadth of their offering, and there is insufficient industry experience so far to fairly assess how oneAPI compares or competes with Xilinx’s VITIS, or with Nvidia’s CUDA environment. While the other solutions aim specifically at acceleration, Intel appears to be attacking the problem one level of abstraction higher, which could be a spectacular success, or could be a bridge too far.
Stratix 10 NX was announced as part of a larger Intel data center announcement, which included the debut of 3rd Gen Xeon processors with built-in AI acceleration through the integration of bfloat16 support. Bfloat16 has half the bits of FP32 for AI inference with comparable model accuracy. Also announced were the New Intel Optane persistent memory series and new 3D NAND SSDs. Taken together, the announcements show a steady drum beat of progress across the spectrum of data center AI workload optimization.
Intel is defending their data center dominance on multiple fronts these days, with strong pressure coming from acceleration providers such as Nvidia and Xilinx, alternative processors and architectures such as AMD and ARM, and a proliferation of standards-based approaches that minimize the company’s ability to hold off competition with a breadth-first and integration strategy. It will be interesting to watch the next few years unfold as the increasingly lucrative data center market attracts even more, and better-funded, attackers.
BLOCK FP12, and BLOCK FP16 … I haven’t seen Intel support block FP before, other than in their early Nervana designs. Is this something novel for FPGA FP ai processing?