How fast? Try 1.5 GHz.
Now, I know what you’re thinking…
“offer void where prohibited, professional stunt engineer on closed-course benchmark – do not attempt, your frequency may vary, dealer participation may affect final negotiated performance, operating frequency is not valid in North Dakota, Florida, Alaska, Hawaii, or at any location with ambient temperatures above 24C, performance graph not to scale, preliminary datasheet, results not typical…”
Well, before you go inventing all your own “fine print” for why a large-capacity FPGA fabricated on TSMC’s 65nm CMOS can’t possibly run at speeds past 1GHz on a real-world design, let me tell you that A) Speedster FPGAs really exist – I’ve touched one. They even have development boards that come in nice, shrink-wrapped boxes that look like you might be able to grab one off the shelf at Wal-Mart. B) Speedster has architectural features that make these performance claims viable, even though FPGA Journal Labs haven’t independently verified performance claims (by the way, FPGA Journal doesn’t yet have any labs, but if you’re interested in staffing one, give us a call.)
We’re not going to give a name to these architectural features just yet, though. You’ll have to read on – and no fair trying to jump to the end of the article. You see – we build to that.
The first sensible question to ask in evaluating speed claims on FPGAs might be “what prevents FPGAs from running any faster?” After all, maximum operating frequency is really just the point where the first thing in your chain of nice, synchronous logic operation breaks down. If we find out what that first problem is and eliminate it somehow, we should be able to run faster – until we find the second thing.
It’s already hard enough for an FPGA company to make even a datasheet specification for speed. If you have a logic path through your FPGA, a cloud of “delay” stacks up between each pair of registers. In the old days, most of that delay came from the logic elements themselves. That made the math pretty easy – you could just add up the delay for the number of levels of combinational logic between registers, lump in a little compensatory delay for routing, and you had a delay number that determined the fastest you could clock those registers expecting deterministic results.
As geometries got smaller, the routing delay started to take over from the logic. That meant you didn’t really have much of a picture of how much delay there would be through a path until after place-and-route had a go at it. A path that was nicely contained in one area might scream while another that was spread across the chip (or worse yet – stretched between two distant I/Os) might be really slow. If you’re in marketing at an FPGA company and you’re putting the maximum clock speed in the datasheet, what should you list – a single inverter located between two adjacent registers, or 24 levels of combinational logic stretching four times across the breadth of the device? Do you want to say that your FPGA runs at 600MHz, or 12?
Finally, with routing dominating the delay picture, the longest nets you have to route – and the ones with the largest fanout – are the clock lines themselves. When you try distributing a 500MHz clock over a large die, it should not surprise anyone that the edges don’t all arrive at the flops at the same time – or even close. This makes the timing picture even more complicated.
Achronix has developed a new architecture that makes many of those questions virtually disappear. You will still create your design in the usual way – developing RTL code, simulating it with the same simulators, and synthesizing it with industry-standard FPGA tools like Synopsys (formerly Synplicity) Synplify-Pro and Mentor Precision Synthesis. The logic of your design will be boiled down to a network of interconnected 4-input look up tables (LUTs) – nothing new there, either. Those 4-input LUTs are implemented in SRAM-type FPGA fabric on standard-process TSMC 65nm CMOS – still no departure from the modern-day FPGA norm.
Once your design is all simulated and synthesized and running through the vendor-supplied place-and-route tool, the magic starts behind the scenes. Achronix constructs what they call a picoPIPE. While Achronix emphasizes that you don’t need to know the details of what’s going on behind the curtain – we’re all engineers and we can’t possibly resist pulling back a good curtain. So – here we go.
The picoPIPE is a very fine-grained pipelining technique applied to your design – even the “just routing” parts between combinational elements. To visualize this, do a mental experiment – picture placing a register after every single delay-introducing element in a logic path. Did you just go through a LUT? Drop in a register. Did you just traverse a short routing path? Drop in another one. If you spaced registers every few picoseconds, you’d never have a long path between registers and you could run at a really high clock frequency.
“WAAAAAIT!” (I hear you out there.) “That would increase the latency of my design by an order of magnitude, and it would blow the synchronization of the arrival of data at all the places I care about – my REAL registers.” Calm down young Skywalker. That’s not what we’re really doing – that’s why this is a thought experiment. “But, didn’t you just point out that getting the clock around the chip was the hard part? And by asking for a really fast clock to go to all these new registers, you’ve just made that problem much worse!” OK, I can see we’re not going to outsmart you guys easily.
Now, to continue the experiment, let’s take all these registers we’ve inserted and disconnect them from that clock network (the one we can’t build anyway). Instead, let’s connect each set of them with a differential pair and a third “acknowledge” signal. This allows us to operate these pipeline stages asynchronously without stretching a clock line to each of them. Now, to allow data to arrive at the “REAL” registers coherently when there are unequal length pipelines along the way, we’ll have to do some magic synchronization, but with that out of the way, we’ve got a PICOpipe pretty much built.
What did all that accomplish?
We obviously haven’t improved the delay along the longest logic path – physics still works, darnit. We have, however, done away with the clock distribution problem. That’s nice. We’ve also made it so that we can clock in data at the rate of one of our tiny pipeline stages instead of at the rate of the longest combinatorial path through the circuit. All these data are filling the pipeline, giving us a much higher effective throughput – with similar latency to what we had before the whole experiment. In the past, if we clocked our “old school” FPGA at 375 MHz with data coming it at that same rate, we can now clock our Achronix Speedster with its picoPIPE at 1.5 GHz. Latency will be similar on the two implementations, but throughput and external clocking will be improved by something like 4X.
Obviously this technique will apply differently to different types of designs. Datapaths will benefit enormously while control-dominated logic and designs with considerable internal feedback may benefit less. Nonetheless, what we see at the inputs and outputs is an FPGA with 1.5 GHz performance. As a side benefit, the variation in that number should (according to our theory anyway) have much lower variance than in a typical FPGA architecture.
The Achronix Speedster family has four members ranging from about 24K LUTs up to 163K LUTs. Block RAM ranges from about 1.2 Mb on the smallest device up to 11.7Mb on the biggie. Since the devices are likely to be popular for number-crunching datapath designs, a good stock of 18X18 multipliers is included – 50 on the smallest device up to 270 on the largest. Also, in order to take advantage of all that throughput, we need high-speed serial transceivers to pump data onto and off of the chip. Speedster includes up to 40 10.3 Gbps SerDes transceivers as well as DDR3/DDR2 controllers to connect to external memory. User I/O ranges from 488 to 933.
Achronix is already shipping the SPD60 – the third largest device in the family, along with a development board – the SDK60. The board can operate stand-alone or as a PCIe plug-in card. Achronix also supplies a full suite of development tools, including OEM versions of industry-standard synthesis tools tuned for the new technology. The company says volume pricing for the Speedster FPGA family ranges from under $200 to $2500.