Unless you’ve been hiding under a rock, you know that FPGA-based compute acceleration is suddenly a hot topic. And, even from under the rock, you probably got the memo that Intel paid over $16B to acquire Altera a couple years back – mostly to capitalize on this “new” emerging killer app for FPGA technology. These days there is an enormous battle brewing for control of the data center, as Moore’s Law slows to a crawl and engineers look for alternative ways to crunch more data faster with less power.
As we’ve discussed at length in these pages, FPGAs are outstanding platforms for accelerating many types of compute workloads, particularly those where datapaths lend themselves to massively parallel arithmetic operations. FPGAs can crush conventional processors by implementing important chunks of computationally intense algorithms in hardware, with dramatic reduction in latency and (often more important) power consumption.
The big downside to FPGA-based acceleration is the programming model. In order to get optimal performance from a heterogeneous computing system with FPGAs and conventional processors working together, you need a way to partition the problem, turn conventional code into appropriate FPGA architectures, and realize that whole thing in a well-conceived hardware configuration. This requires, among other things, a good deal of expertise in FPGA design, as well as an overall strategy that accounts for getting data into and out of those FPGA accelerators, and a memory and storage architecture that’s up to the task. Getting it right is no small feat, and there are countless ways to go wrong along the way and end up with very little gain from your FPGA investment.
At this week’s Supercomputing conference in Dallas, Bittware (acquired by Molex earlier this year) announced they were “joining forces” with Nallatech (acquired by Molex last year as part of the Molex acquisition of Interconnect Systems, Inc.). In the FPGA acceleration world, this is a big deal. While FPGA-based acceleration may be a new and hot topic for most of us, and it may seem like a brand-new world is opening up for engineering exploration, these folks have been at it for a LONG time.
How long? Allan Cantle founded Nallatech to focus on FPGA-based acceleration in 1993.
Let that date sink in for a minute or two. Yep, Nallatech has been doing FPGA acceleration for over a quarter of a century. Dial that into your decoder ring and figure out what kind of FPGAs were involved way back then. We first began covering Nallatech in 2004, and by that time the company already had more than a decade of experience in dealing with the challenges of what was (at the time) known as “reconfigurable computing.” Bittware has a similarly storied history in acceleration, beginning with DSP-based boards for ISA bus back in 1991, and moving into FPGAs in partnership with Altera in 2004. It is likely that these two companies together have more combined experience and more design successes with FPGA-based acceleration than anyone else on the planet.
It’s interesting that these two long-time veterans in this technology would be united under the Molex banner. With the likes of Intel, NVidia, Xilinx, AMD, Arm, and others throwing crazy resources into the battle – acquiring talent and technology wherever they can to bolster their campaigns – there’s a bit of irony in a connector company cornering two of the most experienced names in the business. But that seems to be exactly what has happened.
According to the announcement, the combined Bittware/Nallatech teams (operating under the Bittware flag) will offer both Intel- and Xilinx-based FPGA acceleration solutions. By betting on both horses, Bittware takes the platform issue off the table. Bittware says they are targeting applications such as machine learning inference, real-time data analytics, high-frequency trading, real-time network monitoring, and video broadcast (among others). None of these come as any surprise, and it appears that Bittware is taking advantage of the combined resources of Bittware, Nallatech, and Molex to land the type of large enterprise customers who would have been leery of the more niche nature of Bittware or Nallatech alone.
Bittware breaks their solutions down into “Compute,” “Network,” and “Storage” – showing their experience right off the bat. While accelerating computation with FPGAs is the glamour play, every part of the computing system needs to be designed for the task or you’ll end up leaving performance on the table. And FPGAs can play a key role in accelerating each of these tasks. FPGAs cut their teeth in the networking business decades ago and have only more recently proven themselves in storage and compute.
On the compute front, Bittware offers HBM2-enabled FPGA devices from both Intel and Xilinx. HBM integration is new for both Intel and Xilinx and is a game-changing innovation that allows acceleration of applications that would otherwise be limited by the bandwidth of conventional discrete memory implementations. For example, Bittware’s 520N-MX is a full-height, double-width PCI-Express card that packs an Intel Stratix 10 MX FPGA, up to 8 GB Integrated HBM2 @ 512 Gbps, four QSFP28 cages supporting up to 100G per port, two DIMMs supporting DDR4 SDRAM, QDR-II+ SRAM or Intel Optane 3D-XPoint, two OCuLink ports for direct expansion to NVMe SSD arrays, and a “Board Management Controller” (BMC) for Intelligent Platform Management. If you’re doing 100G line rate network packet processing or compute-intensive data center applications that demand high memory bandwidths, that packs a lot of punch.
Looking at the Xilinx side of the aisle, the XUPVVH is a 3/4-length PCIe board with a Xilinx Virtex UltraScale+ VU35P/VU37P with integrated 8 GB HBM2 @ 460 GBps, PCIe x16 interface supporting Gen1, Gen2, or Gen3, four QSFP cages for 4x 40/100GbE or 16x 10/25GbE, and up to 256 GBytes DDR4. Because the VIrtex’s 2.8 million logic elements can crank up a lot of heat, Bittware uses what they call their “Viper” platform that uses computer flow simulation to drive the physical board design in a “thermals first” approach, including “the use of heat pipes, airflow channels, and arranging components to maximize the limited available airflow in a server.” Viper boards are passive by default, with active cooling as an option.
Beyond the capabilities and specs of the hardware, the team at Bittware has an incredible amount of experience getting real-world performance out of FPGA-based systems. Partitioning workloads, understanding memory and network bandwidth requirements, and managing thermals in complex data center installations is no small feat, and the expertise this team brings could be the difference between success or failure, or between an optimal system and an unbalanced compromise. With Bittware and Nallatech enjoying the larger footprint and resources of Molex, it will be interesting to see what kind of engagements they attract.
Yes, Kevin, the programming model for FPGAs is a big hurdle. In fact, it’s the focus of my startup. I’ve recently picked up a focus on cloud FPGAs and I’m building momentum for an open-source project to make them more accessible. This project (regardless of which is your favorite h/w programming model) will take care of the interface between the s/w and h/w. The developer using this framework will just pass data structures between s/w and h/w using standard web protocols without having to worry about transporting the bits. Ultimately, I want open-source h/w developed online (http://makerchip.com) and deployed and made available on cloud FPGAs at the push of a button (and a ~6-hr wait). I’m always looking for help. Here’s the project: https://github.com/alessandrocomodi/fpga-webserver.
The lack of movement in the ability to compile code for heterogeneous computing isn’t new, for me it goes back to Inmos in the 1980s. Here’s my solution –
http://parallel.cc
Interestingly, Xilinx blew me off when I said I knew how to speed up the SystemC (C++) modeling approach they’re using for their latest AI/FPGA effort, but I suppose the digital guys are just addicted to clocks in their specifications and old-school C++ compilers. I hear Icarus Verilog outperforms their Verilog simulator, so maybe I should have been less surprised.
Thanks for a very nice article Kevin. Glad to see that true editorial isn’t quite dead yet.
Bringing Bittware and Nallatech is a good move. They both have great respect for each other and they are more than 2x better together than apart. I just came back from the Supercomputer show in Dallas and there is a lot of FPGA activity. Many more people know what a FPGA is and there is a wave of users starting to under how to use them. There tool makers in the Startup Pavilion like LegUp the C/C++ to hardware tool. There were application vendors like Aclectic, that uses CPUs, GPUs, and FPGAs, to accelerate each part of their rendering application to it’s fullest.
Big companies with lots of cash are investing in the space and Intel/Altera guarantees that FPGAs will be supported in the HPC and Big Data spaces for ever. Molex/Bittware/Nallatech were there, but so was Micron/Pico/Convey each touting hardware and applications.
FPGAs are now the third leg of computing and they are in high end computing to stay.
FPGA accelerators and GPUs performance is primarily due to heterogeneous computing.
First, there are no instruction fetches as opposed to a CPU where there are load and arithmetic instructions required for every new operand and a store instruction fetch and a store instruction to put the result back to memory.
Second, the data can be streamed/DMA from memory rather than accessing cache for each operand.
Third, the algorithm execution can be customized in the data flow design, i.e. pipe-lined.
Even if the compiler supported parallel execution(multi-threading), only one thread can execute at a time and there is overhead for synchronization, etc.
There is a major problem that the programmers developing algorithms are not equipped for the tedious and difficult tool chains that must be used to “program” an FPGA. In order to “program” an FPGA .
What really must be done is to design a new FPGA bit-stream and load it into the FPGA. This is the equivalent to designing a new chip with the same IO pins. Again, programmers generally don’t care what an IO pin is.
What is really needed is a design where the algorithm can be changed by loading a small control array that selects the appropriate operator for each stage of the pipe-line. Enough of this C to HDL because it is only the expression evaluation that is required for the data structure that is loaded. Operator precedence of the operators in the algorithm must be handled.