For system designers, Moore’s Law is a gravy train. Every couple of years, you get more gates, more speed, less power consumption, and lower cost. For digital designers and tool developers, however, that gravy train is headed through the tunnel right at you. Every couple of years, you have more gates to design in less time, more complexity to overcome, and tougher verification problems. Your design tools are heavily impacted, too. The old synthesis and place-and-route runs that took a few minutes on an old 200MHz Windows 98 laptop are now running for 24 hours on the latest multi-core, memory-laden, tricked-out machines.
Xilinx’s latest software release goes straight at that problem, acknowledging that in this day of platform-based design, IP re-use, hardware/software verification, and high-speed serial I/O, the toughest FPGA design challenge for most people is still basic timing closure from RTL to bitstream. Xilinx’s new ISE 9.1i includes two major enhancements: “SmartCompile,” to address timing closure on large designs, and some new power optimization capabilities to address the growing sensitivity to power consumption in today’s more FPGA-centric systems.
Xilinx tackled the runtime and productivity issue both in evolutionary progress on runtimes and algorithm efficiency (boosted by faster computing platforms, of course) and in more revolutionary change in the form of incremental design capability.
Before tackling incrementality, Xilinx claims to have achieved a 2.5X average improvement in runtime. Since we here in Journal land always hate “2.5X faster” as a way of talking about runtime improvements – here is what that means, according to our super-secret “execution speed” decoder ring: the runtime would be divided by 2.5, giving a 60% runtime reduction. Xilinx measures runtime over a suite of 100 “typical” customer designs on the same machine running the old and new versions of the software, then averages the deltas. Voila! 60% runtime reduction on average – “2.5X faster” (Ain’t marketing wonderful?) Actually, a 60% reduction is monumental in software performance tuning… particularly on a product whose release number is in the 9.X range. Normally, the easy speed gains are back in releases 1.x, 2.x etc. when you’ve got plenty of stupid n-squared loop issues to clean up. Mature software has much less low-hanging fruit.
Since most design (particularly the timing closure phase) involves iterative running of steps like synthesis and place-and-route, efficient, intelligent incremental design tools can effect a dramatic improvement in average iteration time. If you go into your design and change only one small section, you don’t want to wait around while all the other parts of your design are re-compiled exactly as they were before. You’d like for just the new and changed sections to require recompilation.
All this incrementality sounds great in concept, of course. It’s in the real-world implementation that problems crop up. That’s where Xilinx has had to focus their energy in providing practical incremental design. The classic difficulties in incremental compilation include things like sub-optimal timing results caused by modified parts of the design introducing new critical timing paths, some of which could benefit from a re-placement or re-synthesis of untouched design blocks. Additionally, sometimes you have to rip up or move existing sections of a design to make way for the new, larger, or otherwise different modified sections. Managing this squishy situation is one of the core challenges of incremental design tools. Another challenge is overhead management. Often, the compute and storage overhead required to provide incremental design capability can cause slowdowns and inefficiencies that eat up the speed gains that incrementality is intended to provide.
Xilinx claims to have addressed these issues in developing their new “SmartCompile” technology. When you’ve already run your design once, you can make minor changes without requiring the software to do a complete re-implementation of the design from scratch. Besides improving runtimes, this locks down the timing on parts of the design where you’ve already completed timing closure – the old non-incremental process could sometimes blow the results from one section of the design while processing changes in another. Preserving the old results as much as possible between incremental iterations helps speed convergence. Overall, Xilinx claims another “2.5X speedup” from incrementality on subsequent runs. By our decoder-ring math again – that means that you might save an average of 84% runtime on an incremental run with the new release versus a full run with the old release. Xilinx calls this a “6.25X faster” runtime.
Xilinx has also added a feature called “SmartPreview” that allows place-and-route to “pause” and “resume” – this allows you to view intermediate results without waiting for the whole run – a big time saver if you discover something that’s wrong early on instead of waiting for an overnighter to complete. The SmartPreview allows you to create a bitstream to take into a part immediately for debug, preserve your latest results as a snapshot, abort the place-and-route process entirely, or move to the next run of a multi-pass place-and-route process.
Finally, the new “SmartCompile” boasts a feature called “SmartGuide” that attempts to minimize the change between iterations, reducing the timing perturbation and runtimes for small design changes of the type usually encountered late in the design cycle. SmartGuide is a pushbutton algorithm that compares the new and old versions of a design, uses the original design as a guide, and incrementally places the new or changed elements and critical elements. It then identifies critical timing paths and incrementally routes new and critical paths to meet timing in order to reach a final implementation. Furthermore, you can manually identify partitions if you want to exercise more control, which is particularly useful for situations like team-based design where multiple engineers may be working on a single FPGA and be at different phases of their own implementation.
Xilinx has thrown a few more convenience and productivity features into the release, including a TcL console to allow scripting, the hooks necessary to integrate a variety of source code management systems into your design flow with ISE, and an expanded timing closure environment that brings together the various timing closure tools into one user interface.
The second major challenge tackled by the new ISE is power optimization. Power in FPGA designs is only recently becoming a first-class concern. Old FPGA users just took whatever power consumption they got, plugged in bigger power supplies and fans if needed, and took it all as an excuse for the occasional marshmallow roast over their development boards. Today, however, many designers actually care how much power their FPGA design will burn. FPGAs are becoming more central to the system, larger, and faster. All of those factors make them higher on the most-scrutinized components list for suspicion of power mongering.
First on Xilinx’s list of chores was to improve the accuracy and timeliness of power estimation. Many design projects have ended up far down the implementation trail only to discover that they were impossibly far over their power budget. Early estimation is the only way to get confidence that you’re headed toward a workable solution from a power perspective. Xilinx has included new power estimation spreadsheets into ISE that help you get a rough idea of the power picture early on.
Once you get into your design process, you want to cut power consumption as much as possible. Xilinx has added new power optimization in both synthesis and place-and-route that they claim automatically reduces dynamic power by an average of 10%. Power consumption is highly design- and stimulus-dependent, however, so don’t be surprised if you see a wide variation in your results.
9.1i is available immediately from Xilinx, and additional packages such as the ChipScope 9.1i version have already been announced.