It’s all coming together.
In response to a rapidly broadening market and increasingly niched and specialized competition, the world’s largest FPGA company has narrowed and focused. With the celebration of their 25th anniversary and the introduction of their new Virtex-6 and Spartan-6 lines, the company is demonstrating the principle of doing more with less – ironically solving a broader class of problems with a narrower range of solutions.
Of course, a lot of this consolidation actually takes place under the hood, where most designers may never notice. As the company has demonstrated in the past, once you have a solid, functional FPGA platform, it is a comparatively trivial matter to re-market it wearing the clothes of a number of specific, targeted areas. Venerable Virtex has already appeared in the space version, the automotive version, and wearing a number of other pseudo-disguises that make it appear ready-made to address the most pressing challenges of a number of valuable target application areas.
Last May, archrival Altera put Xilinx in the unenviable position of being a technology-node behind on both of its principal platforms. When Altera announced the 40nm Stratix IV, they garnered a clean sweep of the nanometer count with Stratix IV trumping Virtex-5 (40nm vs 65nm) and Cyclone III besting Spartan-3 (65nm vs 90nm). While process node certainly isn’t the only thing that matters, it’s long been a linchpin for bragging rights in the semiconductor world. With these new announcements, Xilinx is hoping to gain back a significant chunk of that ground, matching Altera’s node on the high end and leapfrogging on the low end by skipping the 65nm Spartan family altogether. Now, Altera still likely has bragging rights on the high end, as Stratix IV has been shipping for awhile now, but Xilinx is the first on the block with a 45nm low-cost family.
If you’re the type that’s interested in just the top level, you can stop reading now. That’s the end of the dubious marketing claims section.
Now let’s jump into the technical details, like how the V6/S6 duo is “Addressing the Programmable Imperative.”
Dear Programmable Imperative, OK, wait – sorry, that’s more marketing fluff.
Now, really, here are the technical details. Xilinx is continuing their multi-fab strategy, with the new lines being produced by UMC and Samsung. At first, it seems, Virtex-6 will be produced on UMC’s 40nm process while Spartan-6 will start on Samsung’s 45nm line. Samsung sneaked quietly into the Xilinx stable, with rumors surfacing a month or so ago that Xilinx was abandoning long-term partner UMC in favor of Samsung for their 40nm designs. These rumors turned out to be mostly false, as the two fabs are splitting the Virtex/Spartan pie. Fabrication strategy has long been a key difference between Xilinx and Altera, with Xilinx favoring a “keep your options open” multi-vendor strategy and Altera sticking steadfastly to a “bet it all on one horse” plan with key partner TSMC. The results of these contrasting approaches have varied over the years – Xilinx often having a better fallback plan if one company has difficulty or falls off the pace, and Altera recently benefiting from TSMC’s speed in bringing up the 40nm node.
Virtex-6 is based on UMC’s 40nm, 12-layer (11-metal, 1-poly) triple-oxide process. Triple oxide is a power reduction strategy first deployed in Virtex-4, where varying thicknesses of insulating gate oxide are used selectively to reduce leakage current in areas where maximum switching frequencies are not required. Spartan-6 is based on Samsung’s 45nm 9-layer dual-oxide process. Apparently Samsung (at 45nm) was ready for prime time ahead of UMC (at 40nm) so Spartan-6 is sampling now whereas Virtex-6 is set for Q2 samples. Both lines are expected to be in production by the end of 2009.
With the exception of the fabrication process, there is a lot in common between the two new families. The Virtex-6 and Spartan-6 platforms share a common architecture – right down to the LUT fabric. This is the beginning of the “grand unification” part we mentioned. For quite awhile now, FPGA vendors have been using completely different architectures for their high-end and low-end families. For example, Virtex-4 and Virtex-5 both used a 6-input look-up table (LUT) fabric, while the Spartan series through Spartan-3 were based on a 4-input LUT. The same over in Altera land where Stratix has been based on the Altera ALM (a kinda’ six-input LUT), while Cyclone retained the LUT4.
Why is this important? IP re-use, portability, and predictability. If you have RTL-based IP and synthesize it for a LUT4 architecture, you’ll get completely different results from synthesizing for a LUT6 architecture. In LUT4, you’ll use more routing resources, have more nets in your design, use up more logic cells (and registers), have more levels of logic between registers, and so forth. Now, with both Xilinx families sharing the LUT6, you should see similar (but not necessarily identical) structural results when synthesizing the same core IP for Spartan and Virtex. This makes results more predictable for you, and engineering and support easier for Xilinx.
For years, the LUT4 was the gold standard in FPGAs. Research papers had unequivocally proven that the 4-input look-up table was more efficient at mapping a broad range of logic than any other width, so the problem was solved, right? Wrong. The assumptions behind the LUT4 research didn’t account for the current situation at very small-process geometries. As we passed through about 90nm, the share of the delay and area caused by routing began to dwarf the contribution of the logic cells themselves. When wires become more expensive than components, the game changes. Wider LUTs that can map more complex combinational functions begin to have a significant advantage over narrower cells that require more interconnect. Today, both Xilinx and Altera have switched to wider, 6-input LUTs for their high end families, and now Xilinx has brought the LUT6 to low-cost FPGAs as well. Beware, though – both companies still measure their fabric capacity in something like “equivalent LUT4s”. This is accomplished by taking the number of LUT6 cells and multiplying by the largest number that marketing can convince engineering is plausible. In the case of Xilinx, that magic number is 1.6. (For Altera, it is 2.5, which leads us to the conclusion that Altera has tougher marketing people.)
Still looking at the LUT fabric, there’s another curiosity in the new Xilinx architecture. This time, the number of flip-flops per LUT is effectively doubled. Each “slice” in the new families contains four six-input LUTs and eight flip-flops. In previous generations, there was a one-to-one mapping between LUTs and flip-flops. The doubling of registers brings the Xilinx architecture a bit closer to Altera’s semi-incomprehensible 8-input ALM (which is a LUT6 most of the time, except when it exercises super-powers and morphs into a limited LUT7, or splits into two LUT4s, or a LUT5 and a LUT3, or two LUT5s with 2 shared inputs, or, oh, sorry, got carried away there…) The thing to remember about all these exotic architectural marketing claims is that the logic structure is only as good as the synthesis tool you’re using. If the clever folks at Synopsys/Synplicity, Mentor Graphics, and – to a lesser degree – the synthesis groups at Xilinx and Altera haven’t fine-tuned their tools to take advantage of a novel architecture, you’ll never see a difference. Presumably, Xilinx saw more efficient utilization with synthesis tools with the additional register, so we should probably all be happy it’s there.
The two new families provide a continuous range of fabric densities, with Spartan-6 offering from 2,104 to 92,160 LUT6 cells (3,366 to 147,456 marketing cells) and Virtex-6 weighing in with a daunting 46,560 to 474,240 LUT6 cells (74,496 to 758,784 marketing cells). This leaves about a two-device density overlap between the families. These numbers are approximately double the previous generation Xilinx devices, and they may possibly be larger (subject to intensive marketing debate, of course) than the corresponding Altera families, with Cyclone III boasting up to 119,088 actual 4-input logic cells (against Spartan-6’s estimated 147K) and Stratix IV’s largest device measuring 212,480 of the aforementioned ALMs, which are equated to 681,100 equivalent marketing cells (against Virtex-6’s 758K). Clear? Nah, not to us either.
Moving out from the LUT fabric, both new families contain varying concentrations of dedicated DSP blocks. The Spartan-6 family uses the DSP48A1 block, which is comprised of an 18X18 multiplier, an adder, and an accumulator. Virtex-6 uses the more complex DSP48E1 block, which contains a 25X18 multiplier, an accumulator, and a single-instruction multiple data (SIMD) arithmetic unit, plus a pre-adder to improve performance. Spartan-6 devices contain from 4 to 182 DSP48A1s, while Virtex-6 devices contain from 288 to a whopping 2,016 DSP48E1 blocks. These numbers are approximately double those of the previous-generation Xilinx families, and a decent notch more than the 1360 18X18 multipliers available in the largest announced Altera Stratix IV device.
Still on the subject of embedded hard-IP, the big story these days is high-speed serial I/O. It’s everywhere… and by “everywhere” we mean even in the Spartan family. Spartan-6 comes in two flavors – LX and LXT, with the “T” representing, of course, “Transceivers” – of the multi-gigabit serial kind. Spartan-6 LXT represents Xilinx’s first foray into the low-cost FPGA with SerDes realm – joining Lattice Semiconductor and Altera – both of whom have offered relatively low-cost transceiver-havin’ FPGAs for a couple of years now. Spartan-6 LXT offers from 2 to 6 low-power GTP transceivers, each operating at up to 3.2 Gbps, and a dedicated PCI Express endpoint block. Compared with other market offerings, Lattice’s ECP2M contains up to 16 3.125Gbps transceivers, and Altera’s just-announced Arria II GX contains up to 16 3.75Gbps transceivers (but that’s the topic of next week’s article).
Virtex-6 brings in the Serious SerDes, however, with up to 36 GTX low-power transceivers (rated at up to 6.5 Gbps each) in the LXT and SXT versions and supposedly up to 64 GTH (definitely not low-power) 11.4 Gbps transcievers in the not-yet-in-the-product-table HXT family. In addition to the tons of transceivers, the announced devices sport up to four tri-mode (10/100/1000) Ethernet MACs and up to two PCIe interface blocks. Stacked up against its primary SerDes competition (the just-yesterday-announced Altera Stratix IV GT – more on that next week also), Stratix IV GT offers up to 24 11.3 Gbps transceivers and up to 24 additional 6.5 Gbps units.
As is the case these days, a new process node no longer brings an across-the-board bounty of density, speed, and power. You have to choose. If you’re lucky, you can pick two of the three. If you’re not careful, however, you may get only one. Xilinx was careful. They accepted modest increases in speed in order to realize significant gains in density and power. The important speed metric these days is I/O bandwidth – which we’ve already discussed. Getting much less attention these days is Fmax for the fabric. This is actually a good trend for several reasons. First, most of the high-end FPGAs today can toggle the fabric fast enough for most applications. If you’ve got a killer-app that needs bleeding-edge frequencies in the 1.5GHz range – go talk to Achronix. For the rest of the world, the 300-600 MHz that you can theoretically squeeze out of the fabric for the last couple of generations of FPGAs isn’t usually the bottleneck. Yes, timing closure will still sometimes keep you awake for a couple of horrific weeks toward the end of your project, but that’s one of the special joys of an engineering career.
Power is another matter. While each process node is predicted to be the end of power progress as we know it, Xilinx continues to slice off major chunks of power consumption – including static power – with each new family. The company claims that total power consumption (static plus dynamic) is about half that of previous generations of Xilinx products. With power, your mileage will vary – a lot. Datasheet numbers have little bearing on the real world, so we’ll have to wait for some actual design results to see how Xilinx’s claims measure up.
Both Virtex-6 and Spartan-6 are now packaged with flip-chip technology, presumably because of signal integrity advantages over wire-bond packaging. Today, flip-chip is still more expensive than wire-bond, which is ironic. Wire-bond is significantly more complex to produce, less reliable, and offers lower signal integrity. We expect that flip-chip will gradually lose its price premium and most devices will be packaged that way. Spartan-6, however, still retains its I/O ring architecture (not going completely over to the column-based structure of Virtex), presumably for cost reasons.
Speaking of cost – there is not much information available just yet. Cost is an elusive specification at best – with many vendors giving unrealistic super-high-volume pricing in their press, others giving no pricing at all, and still others quoting small-volume distributor prices. Xilinx says Spartan-6 prices will range “from $3 – $54 in high volume” and Virtex-6 “from $57-$2100.”
Interestingly, the wide array of previous Virtex and Spartan families has now apparently been reduced to five – Spartan-6 LX and LXT, and Virtex-6 LXT, SXT, and HXT. The capabilities of these families, however, covers a much wider range of applications than the previous plethora ever did. Also of note, the embedded PowerPC is now nowhere to be found in the Xilinx line. Perhaps the market has spoken on the flexibility and utility of soft-core embedded processors compared with the rather rigid and unwieldy hard-wired sort.
Xilinx is taking a top-down view of applications of these new devices, presenting a marketing picture that starts with the new device families, adds tools and IP, continues with domain-specific development kits, and tops it all off with comprehensive reference designs for targeted applications. Using this complete stack of technology, engineers in these targeted areas may be able to start out at the 80% complete stage – with a working reference design on a development board that fairly closely resembles their final application. If you’re lucky enough to be developing one of those targeted products, your peers will envy you as they scratch out new VHDL and Verilog and cobble together collections of disparate IP for their non-mainstream designs. Still, this range of flexibility is what makes FPGAs among the most powerful and valuable devices on the market today. Perhaps there is something to that Programmable Imperative…