Introduction
Prototyping an ASIC, ASSP, or SoC onto a single FPGA is not without its challenges. You have to deal with differences in ASIC and FPGA architectures, optimize for performance and area requirements, and account for a debug strategy. Unfortunately, this is only the tip of the iceberg when tasked with implementing an ASIC onto a multi-FPGA platform. Currently, the largest FPGAs have a capacity of roughly 1.5 M equivalent ASIC gates, so when prototyping a chip larger than this, a multi-FPGA strategy must be in place, and several more pitfalls must be accounted for.
And yet it is well worth the effort. Over the years, FPGA prototyping has proven indispensable for functional verification and early software integration. With mask costs approaching $3M for 45nm designs, avoiding a re-spin by prototyping with FPGAs is a small price to pay—even if it means a minor deviation from the final ASIC environment (e.g., clocking, memories, and speed). The larger the design, the more development and manufacturing cost. These larger designs must be partitioned into several FPGAs if they are to be prototyped. It comes as no surprise that for multi-FPGA prototyping, a little pre-planning can go a long way.
Synthesis for ASIC Prototyping
First, there are the fundamental technologies that any prototyping flow needs in place, whether it be for a single or multiple device platform. These capabilities must account for the standard ASIC-like constructs such as gated clocks and Synopsys Designware components.
Clock gating, while necessary in the ASIC world to conserve power in portable devices, can lead to poor results in FPGAs. Hence, conversion of these gated clocks to their FPGA functional equivalents is recommended. Most clock nets in an FPGA should be mapped to high-speed, low-skew clock lines. Nets directly driving sequential elements are typically routed this way, but when clocks are gated, they are taken off these high-speed routes. The resulting implementation leads to poor performance and potential setup and hold-time violations. FPGA synthesis should be able to convert these gated clocks to functionally equivalent logic, such as using the enable pin available on most sequential elements.
While the figure below demonstrates a simple gated clock, it is worth noting that not all gated clocks are created equal. Gated clocking schemes can be extensive, resulting in multi-level logic that drives not just simple registers but memory and DSP blocks. Clock divider circuitry, such as simple counters, is another example where a clocks is taken off the clock-line and may result in severe skew. A synthesis flow should be able to handle these structures.
Figure 1: Automated gated clock conversion in synthesis
In addition to gated clocks, ASIC designs commonly use Synopsys Designware library components, such datapath elements, memories, and FIFO controllers. A flow that provides support for such components ensures a transparent migration of designs that use them.
Performance is typically important for all FPGA projects, and ASIC Prototyping is no exception. More often than not, the prototype is not expected to run at actual ASIC speeds but fast enough either to handle real-time input, communicate with an external interface, or at least provide a verification environment several times faster than simulation. In such cases synthesis needs all the usual optimization capabilities, such as advanced technology inference, retiming, resource sharing, and easy control of resource allocation.
One of the more critical performance optimizations in today’s complex designs is physical synthesis—the ability to use physical characteristics of the target FPGA to improve frequency. While regular RTL synthesis only takes into account logic cell delays and simple timing models of interconnect delays, physical synthesis takes into account where the actual logic may be placed on the device and advanced delay models of routing resources. With this information it produce a netlist optimized for performance. This is particularly useful for high-end devices, which are typically used in ASIC prototyping. And because high-end FPGAs are available from different vendors, extensive device support of physical synthesis is critical in order to keep options open when finding the most suitable FPGA for prototyping.
Up-Front Partitioning Strategy
Once you have ensured that synthesis can support a basic prototyping flow, your next hurdle is devising a partitioning strategy. Carefully addressing this can affect the prototype’s system performance, cost of hardware, and time spent on manual intervention.
First and foremost, try to consider the “high level” logic partitions at the front-end of the design cycle. As you might guess, this is easier said than done—particularly with large teams where the designers and verification engineers are possibly at different sites or not working together closely. Experience has shown that SoCs designed with prototyping in mind can achieve much greater system performance improvement than designs where it is not planned. With or without this advantage, however, much can be done at the back-end.
Sizing Up the ASIC
It is important to get a feel—sooner rather than later—for the minimum number of required FPGAs and interconnect structure. This gives an idea of the task that lays ahead. Automated or semi-automated partitioning software can be of immense help in this exploratory phase. By using such software, it can be as straightforward as importing all Verilog, VHDL, SystemVerilog, and post-synthesis design files, or a combination of any, and letting the tool perform an accurate gate-level estimate using encapsulated, bottom-up synthesis. Older generation RTL partitioning software limited itself to rough area estimates through pure RTL analysis, thereby ignoring gate-level details. Next-generation tools have moved beyond this by performing full up-front synthesis to extract accurate timing and area data. The integration and certification between the partitioning software and synthesis tool is critical at this stage.
Hardware Independence
Accurate gate-level data is useful when it is time to shop around for the best hardware, whether you are looking at off-the-shelf solutions or considering rolling your own PCB.
You not only should consider the number of FPGAs and capacity of each, but also the connectivity break-down. Off-the-shelf solutions tend to have specific FPGA interconnect layouts—some interconnects are flexible (via cables or re-programmable cross-bars) while others are fixed. Perhaps your interconnect requirements are so specialized that you may have to go to a custom PCB. You can investigate this with either a fully automated or semi-automated partition flow. In a purely software-guided flow, the tool should be able to manage the partition to minimize the number of FPGAs and/or maximize performance. Alternatively, if desired, you should be able to semi-manually investigate various partitioning possibilities through the use of an impact table, where graphically dragging and dropping logic blocks into different partitions help develop a potential interconnectivity map, after which you let the software complete the partition based on your settings. Either of these methods allow you to gain a high-level view of your hardware requirements.
Hardware, in this case, not only means the PCB layout but also the FPGAs themselves. When comparing FPGA vendors, perhaps you have found certain features of one FPGA family more suitable for your application than the other. Understanding your requirements and having the flexibility of vendor independence before selecting your hardware can prevent buyer’s remorse later in the implementation phase.
Timing-Based, Hierarchical Partitioning
Full bottom-up synthesis is not only important for accurate gate-level estimation, but for accurate timing analysis to achieve a partition optimized for performance. High operating speeds cannot be achieved without careful analysis of the timing paths. For large ASIC designs, hierarchical analysis, partitioning, and optimization are a must in order to keep the netlist database to a manageable size. This is opposed to “flattening” the design, or effectively reducing the netlist into one “module” that creates a copy (with a new name) of each lower level module definition each time it is re-used. This approach not only leads to long run-times but may not even be feasible for extremely large ASIC designs. A hierarchical approach leads to a smaller database with shorter run times and optimizes the entire design according to its original hierarchy, allowing for effective clock and data-path analysis.
Clock Analysis & Optimization
One element of timing-based partitioning is the distribution of clocks. The skew of the multiple-FPGA systems comes from the combination of skews on the board and inside the FPGAs. When a clock is generated inside one FPGA and distributed to others, the board skew for each receiving FPGA has to be well balanced, otherwise it will cause hold-time violations. The loop-back structure such as the one in the following figure is necessary if the clock-generation FPGA also receives the same clock.
Figure 2: Clock loop-back to avoid clock skew on board
Other alternatives to eliminate board skew include isolation or duplication of the clock circuitry as shown below.
Figure 3: Clock isolation to avoid board clock-skew
Given the limited clock distribution resources on a typical PCB, multiple schemes may co-exist to ensure minimal board skew. A well-planned partition, done manually or via an automatic partitioner, will limit the number of clock domains to be split across FPGAs so every clock gets the support from the FPGA’s low-skew clock lines.
Figure 4: Clock replication to avoid board clock skew
Data-Path Analysis and Optimization
Even with the right clock distribution, however, the system may be running at an order of magnitude slower if the rest of the design is not partitioned properly.
Excessive delays often come from purely combinatorial signals traversing a single FPGA as a result of a design partition. These “combinatorial hops” introduce extra delays through board traces and FPGA input and output buffers. They also disable potential timing optimizations that synthesis could perform on complete paths as opposed to partial segments. The partitioner’s timing engine must recognize hops and avoid them when possible—though not all can be eliminated in some cases.
Degradation in performance can also come from pin multiplexing, typically done to overcome the limited pin count per FPGA. The lower the partitioned logic block’s I/O count, the less pin multiplexing is needed, and the higher the system performance. This is a well-known challenge for partitioning software. Next generation solutions use data-path clustering algorithms that not only consider gate count but also the timing criticality of data paths and potential for combinatorial hops.
Constraint Generation for Proper Time Budgeting
After the partitioner has done an adequate job of clock and datapath optimization, it should generate a set of timing constraints to pass accurate timing budget information to synthesis in order to perform FPGA-wide optimization. Timing specifications of the whole design have to be adjusted per partition to account for the delays on interconnect wires, combinatorial hops that cannot be eliminated, and pin multiplexing. This timing budget is efficiently performed with the help of the partitioner’s built-in timing engine. The new set of the timing constraints ensures each FPGA is optimized for best performance by synthesis and place-and-route.
Incremental Flows
As mentioned earlier, incremental flows are critical in ASIC prototyping given the debug nature of the flow. A design fix should lead to only a re-synthesis of the relevant FPGA and a re-synthesis of only the change within that FPGA. Verification schedules cannot afford re-synthesis of the entire design of four or more FPGAs. Parallel synthesis and place-and-route can mitigate this problem, and should be supported, but a full re-synthesis of even a single FPGA that is near-full is time consuming in itself. In such a case, an incremental flow both at the multi-FPGA and single FPGA level can reduce iteration time and shorten schedule.
Furthermore, some incremental synthesis flows require user intervention to predict the location of a potential change. This is not always practical since it is difficult to anticipate the location of a bug. More effective is an automatic incremental synthesis that can detect design changes and incrementally synthesis them without user-intervention.
Figure 5: Multi-FPGA Synthesis & Partitioning Flow
Conclusion
Experience has shown that complex multi-FPGA prototyping requires careful integration of advanced synthesis and intelligent partitioning, as summarized in the figure below.
This is not to downplay, however, the importance and impact careful planning. Many of the problems that have plagued prototyping engineers over the years have either been eliminated or simplified with the latest software flows, but an ASIC team can save itself even more grief by considering a partitioning strategy up-front and carefully selecting the right hardware target.
by Nang-Ping Chen, Auspy, Inc. and Ehab Mohsen, Mentor Graphics
Nang-Ping Chen, Auspy, Inc., Founder and President: Nang-Ping Chen is the founder and president of Auspy, Inc. Prior to Auspy, he was the founder and CTO of PiE, Inc., an ASIC emulation company eventually acquired by Quickturn and subsequently Cadence Design Systems. Dr. Chen received his Ph.D from U.C. Berkeley in 1983.
Ehab Mohsen, Mentor Graphics Technical Marketing Engineer: Ehab Mohsen is a Technical Marketing Engineer FPGA synthesis division for Mentor Graphics. Prior to Mentor, Mohsen worked at Aptix Corporation as a Technical Marketing Engineer. He holds a BSEE from the University of California, Berkeley.
13 thoughts on “A Synthesis & Partitioning Strategy for Effective Multi-FPGA Prototyping”