Everything in technology changes – evolves – improves.
First, we had 8-bit processors, then 16, then 32… now many of us are tapping the keys on 64-bit devices.
Nothing stays still for very long.
Why, then, have we lived for around a decade with very little change to the garden-variety 18×18 multipliers in our hardened FPGA DSP blocks. Except for a few minor improvements, those haven’t really progressed in years.
To paraphrase something Bill Gates apparently never really said: “Why would anybody ever need more than 18×18 bit multiplication?”
OK, wait. There has been some evolution in DSP blocks. We’ve got from 18×18 multipliers to multiplier-accumulator-ALU-ish blocks with all kinds of fancy carry logic. We’ve even got asymmetric multipliers that have a wider side to accommodate a few tougher problems. Both of the major vendors have continued to improve their blocks in ways that allow more complex operations to be done without jumping out to the LUT fabric.
Altera, however, has just raised the stakes a lot – with their complete re-design of the DSP block for their upcoming 28nm Stratix-V line.
For the tried-and-true sweet spot of FPGA, 18×18 multipliers were just fine, but with FPGA markets expanding into areas like medical imaging, wireless, mil/aero, and test and measurement, wider fixed- and floating-point multiplication is required to solve the real-world problems. If your FPGA can support those operations in hard-wired logic, you can skip the LUT fabric altogether, improve your throughput and power consumption, and save the programmable fabric for other work – or, better yet, save the money by buying a smaller FPGA.
The key new element in Altera’s DSP is variable precision. Instead of a fixed-width hardware multiplier, the company has introduced a fracturable/cascadable multiplier that can deliver a variety of bit widths very efficiently. To avoid glossing our eyes over with the exhaustive list of every possible combination, we’ll just say that you can choose precision from 9X9 up to 54X54, including asymmetric settings, with very little wasted hardware. Floating point mantissa multiplication is easily accomplished as well, so the enthusiasts of the relatively narrow area of FPGA-accelerated high-performance computing (or “reconfigurable computing”) will be very excited. (You see, OpenFPGA.org? Somebody IS listening.)
Back in the days of Stratix II, an Altera DSP block had four independent 18×18 multipliers (four 36-bit inputs). For Stratix III, the company doubled the block and made it splittable (four 72-bit inputs), so that we could use “Half Blocks.” The DSP block then could do eight 18×18 multiplications summing, or four 18×18 multiplications independently. Now, the DSP block has four of the new variable-precision blocks (four 72-bit inputs), so the unit can do eight 18×18 multiplications summing, eight 18×18 multiplications independently, and high precision operations.
The new block has two native modes – “18-bit” and “high-precision.” In “18-bit” mode, two 18×18 products can be summed into a 64-bit accumulator (with 37-bit precision out of the adder), or two 18×18 products can be independently output with 32-bit product precision. In “High-Precision” mode, you can do 27×27 multiplication with a 64-bit accumulator and 18×36 with a 64-bit accumulator. This means you can do single-precision floating-point mantissa multiplication in one variable-precision DSP block. The 64-bit accumulators allow for cascading without loss of precision.
Altera lists many common applications where this all comes in handy. For example, FFTs require high-precision complex multiplication. The data width increases with each stage, while the coefficient remains the same, so we can go from 18×18 to 18×25 to 18×36. With the new architecture, each of these can be done in a single block. With previous-generation blocks, the number of DSP blocks required could double.
For floating-point precision, using the 64-bit cascade, a single-precision mantissa multiplication can be done with one block at 27×27, or a double-precision 54×54 can be implemented with four blocks cascaded. Four blocks cascaded could do a single-precision floating-point FFT’s complex multiplication.
The combinations and permutations go on and on, of course. Altera looked at a number of critical popular applications in designing the new blocks, and the net effect is that you’ll use far fewer of the new blocks to accomplish the same math, and far less often be required to take your critical timing path out of the hardened world of your DSP blocks and into the LUT fabric.
Will this benefit you? The answer is – sometimes.
If you’re doing arithmetic operations that require more than the fixed-point precision choices available currently, you’ll certainly be able to do them with fewer DSP blocks, and with fewer excursions into the LUT fabric. That means you’ll have more options. If the number of DSP blocks was the reason you had to buy that bigger FPGA, you can now buy a smaller one. (Don’t tell Altera they just engineered themselves into a smaller sale.)
If you were pushing it on Fmax because some of your arithmetic logic was bleeding over into the LUTs, you may now be able to operate the multiplier-accumulator part of your datapath closer to the datasheet frequencies. Or, you may have a lot less work to do on timing closure when you’re finishing up your design.
If you were resource sharing DSP blocks because you were limited in the number available, now you can go with more parallelism and potentially improve your throughput and/or latency. This would, of course, also translate to less memory/register resources being used in the course of sharing magic.
Another group certain to benefit from this architecture are those using high-level synthesis to come from algorithmic representations in C/C++ or other untimed high-level languages into FPGA hardware. If your high-level synthesis tool has this more flexible block in its tool chest (and if it has the wherewithal to use it properly), you’ll magically get better results without even worrying about it.
As the marginal returns from each new process node continue to diminish, FPGA companies need to step up their architectural innovation to keep pace with our insatiable appetite for more performance and efficiency. Smart advances like this new DSP block are exactly what FPGA users need and exactly what the FPGA industry needs to keep attracting new customers and new design wins in an increasingly competitive environment.
13 thoughts on “Re-inventing the DSP Block”