Perhaps when the most important problem is a nail, every solution starts to look like a hammer. With the ramping explosion in AI and machine learning, countless companies are trying to climb on the bandwagon, morphing and melding their existing technologies in an attempt to come up with a differentiated solution that will capture a meaningful share of this mind-boggling emerging opportunity. Everybody from EDA vendors to cloud data centers to GPU companies, FPGA companies, IP companies, and boutique semiconductor startups are spinning stories about how their technology is the key to unlocking the potential of AI.
Most of them will fail.
Achronix, however, looks like a pretty strong contender. This week, the company unveiled their fourth-generation Speedcore eFPGA technology, targeting 7nm CMOS. While this new IP continues the mission of allowing FPGA fabric to be part of any SoC/ASIC design, the latest version has numerous features aimed specifically at accelerating machine learning inferencing. For many with the expertise and resources to do custom chip design, Achronix has a compelling alternative to multi-chip solutions with discrete FPGAs or other custom AI accelerators.
The Speedcore Gen4 embedded FPGA (eFPGA) IP for integration into customers’ SoCs increases performance by 60% over previous generations, reduces power by 50%, and decreases die area by 65%. These are significantly better PPA jumps than one would get simply by moving to the next process node. There are major architectural improvements at the heart of Achronix’s gains. Achronix says they are focusing on bringing “programmable hardware-acceleration capabilities to a broad range of compute, networking, and storage systems for interface protocol bridging/switching, algorithmic acceleration, and packet processing applications.”
Speedcore essentially lets you build an FPGA to your exact specifications. Since modern FPGAs contain LUT fabric, multipliers/DSP blocks, and embedded memories, at the least, and often also include processor cores and other hard blocks, stand-alone FPGAs are always built as a “guess” by the FPGA company about the relative amounts of each of these resources needed for given broad classes of applications. When you select a stand-alone FPGA, it’s always some kind of compromise. You may have to take an FPGA with more multipliers than you need in order to get the amount of LUT fabric you require or the amount of high-speed IO your design needs. There is practically never a situation where the FPGA has exactly the mix of resources required for your application.
eFPGAs like Speedcore allow your to tailor the mix of resources exactly to your anticipated application needs. And, since you’re designing your own SoC anyway, you have the flexibility to merge the FPGA core with any number of other hard resources and IO. None of that changes with this new version of Speedcore, of course. But, with this edition, there are more (and more interesting) options for the types of blocks you can include in your implementation. The capabilities of those new blocks, plus (of course) the Moore’s Law gains from dropping to a 7nm process, make this a powerful new offering for those seeking to accelerate critical applications – particularly if those applications involve AI and machine learning.
Achronix has added what it calls Machine Learning Processor (MLP) blocks to the library of available blocks for building your eFPGA implementation. The company claims the new block delivers 300% higher system performance for artificial intelligence and machine learning (AI/ML) applications. These MLP blocks are aimed at the kind of matrix-multiply operations common in CNN inferencing. Each MLP includes a local cyclical register file that leverages temporal locality for optimal reuse of stored weights or data. The MLPs are tightly coupled with neighboring MLP blocks and larger embedded memory blocks, and they support multiple precision fixed point and floating point formats, including Bfloat16, 16-bit, half-precision floating point, 24-bit floating point, and block floating point (BFP). In many ML applications, reducing the precision of these calculations can yield massive gain in performance and power consumption with very little loss in accuracy. By supporting a wide range of precisions, the MLP allows you to find the optimal compromise between performance and accuracy for your application.
Other architecture changes and improvements include a new 8-1 mux, which allows up to 8-wide muxing with a single level of logic. Also new is an 8-bit ALU with 2x the adder density of the previous generation. The new ALU is aimed at AI/ML applications, where it is frequently used for adders, counters, and comparators. There is also a new 8-bit cascadable bus-maximum function, new high-efficiency dedicated shift registers, and a new 6-input LUT with 2 registers per LUT. Taken together, these should substantially improve throughput and architectural efficiency, provided the Achronix tool chain (and synthesis in particular) can take optimal advantage of the new and changed resources.
Achronix has also added a new independent dedicated bus-routing structure to the architecture, allowing bus-grouped routing separate from the normal bit-wise routing channels. This should minimize congestion as well as improving timing by providing matched-length connections for all bits in a bus. The company says these should be optimal for busses running between memories and MLPs, and they effectively create a giant, run-time-configurable switching network on-chip. Cascadable 4-to-1 bus routing provides 2x performance for busses while saving LUT resources.
Architectural improvements in the new Speedcore allow LUT-based multipliers to be implemented more efficiently, providing the ability to create a 6×6 multiplier, using only 11 LUTs, that operates at 1 GHz. Typical FPGA implementations would require 21 LUTs for the same functional implementation.
The new resources are organized on the chip using non-traditional column adjacency that Achronix says doubles compute operations’ density by providing cascade routing between the new MLP blocks and embedded memory blocks. This dataflow is optimized for AI/ML applications and should also result in significant power savings on those types of operations, because less power will be consumed in data transfers between compute and memory resources.
Achronix uses its Speedcore Builder tool to create custom Speedcore instances to match each user’s requirements. The user can then evaluate the suitability of the generated eFPGA block for their application, and Achronix can supply die size and power information as well. This allows design teams to have a solid understanding of the functional applicability, performance, and power consumption of their eFPGA implementations long before they commit to silicon.
Achronix says Speedcore Gen4 for TSMC 7nm CMOS is available today and will be in production in 1H 2019. The company will then back-port Speedcore Gen4 for TSMC 16nm and 12nm with availability in 2H 2019.
Speedcore Gen4 should bring impressive levels of FPGA and AI/ML acceleration capability to many applications, and it could save dramatically on system cost, power, and complexity compared with solutions that use stand-alone FPGAs. With the expected dramatic growth in the market for AI/ML acceleration, we also expect to see third parties developing commercial specialized accelerator chips based on the Achronix IP. It will be interesting to watch the evolution of this market as design teams size up the various competing alternatives for compute acceleration in this exciting new domain.