It’s been more than twenty years since I started working on high-level synthesis (HLS). You might say I’ve studied the topic a lot. For most of those two-plus decades, HLS has been widely considered the “design methodology of the future.” And there are those who have held onto the belief that it always will be.
For those of you not in tune with the terms, high-level synthesis is the automatic creation of hardware architectures from behavioral descriptions. At first, HLS was known as “behavioral synthesis.” But, after some early bad experiences, the EDA industry quietly shifted the name over to HLS – hoping that nobody would notice or have episodes of PTSD when confronted with the idea.
Why has HLS been so slow to find its way to mainstream adoption? First (and foremost) because HLS is really, really hard. Super hard. Crazy, maddeningly hard. Consider how difficult it is to write software to perform even seemingly simple human tasks – like driving a car, or recognizing English. Even for these basic skills that most humans can easily master, creating code that performs them with human-like decision-making can be a major challenge.
Now consider how difficult it must be to write software that can perform a much more complex human-like task, one that only highly educated humans can master, like designing a sophisticated digital hardware architecture. There is no formula or algorithm that spits out an optimized solution. In fact, there is no clear definition of what “optimal” even means in this context. Any attempt to define it usually begins with, “Well, it depends…”
Computer scientists will point out that the HLS problem is “NP-complete” and that any program we write must be relying on heuristic algorithms to make an educated guess (kinda like we human engineers always do). Nonetheless, over the years, with some very, very bright engineers working tirelessly on the problem, HLS has finally found its way to the forefront.
Today, you can write some behavioral code in a language like C, C++, or SystemC, throw it at an HLS tool, and rapidly get back a detailed hardware design that is almost certainly better than the one you would have spent several months creating yourself in RTL. The productivity factor is amazing. A good hardware engineer armed with HLS can consistently design working hardware at 5x-10x the rate of a non-HLS engineer. And, when the HLS-equipped engineer needs to make an architectural change, the amount of work required is minuscule compared to an RTL revamp.
Cadence has been in the HLS business for years. The company’s C-to-Silicon Compiler has developed a following and has proven itself very capable for certain classes of designs. About a year ago, however, Cadence decided to double down in the HLS game, and they acquired Forte Design Systems – along with their popular Cynthesizer HLS tool.
Cynthesizer was strong in many areas where C-to-Silicon was not. For a while, the choice of Cadence HLS tools depended on what types of design you were doing. C-to-Silicon if your design was more control dominated, and Cynthesizer if you leaned toward datapaths. That made things a bit confusing and difficult for Cadence customers who often had design components with both sets of characteristics.
Now Cadence has merged the two tools into one new engine called “Stratus High-Level Synthesis Platform.” Stratus, the company claims, has the strengths of both C-to-Silicon and Cynthesizer, making the decision process for customers much simpler and the learning curve much easier to conquer. Cadence says Stratus can be used “across an entire system-on-chip (SoC) design.”
Cadence says HLS is now mainstream. The company claims that 15 of the top 20 semiconductor companies rely on HLS, that HLS projects regularly go into the 10M-30M gate range, and that over 1,000 tapeouts have been completed with HLS. That sounds pretty mainstream to us.
Cadence says that Stratus has been designed to optimize around the complex set of engineering goals faced by today’s chip designers, like rapid ECO turnaround, low-power, IP re-use, and routing- and timing-optimized RTL generation. Modern design projects are constantly juggling these factors – as well as the daunting challenge of verification – just about every day of their highly stressful design lives.
One of these in particular, IP re-use, deserves a closer look when it comes to HLS. Today, the vast majority of the re-usable IP is distributed in RTL form. That means you can drop the IP block into your design, and it will (you hope) synthesize nicely down into a set of well-behaved gates that play nicely with the rest of your creation. But different designs have different design goals. And most RTL IP blocks were designed with one specific set of tradeoffs in mind.
Additionally, most RTL IP was designed with pretty specific parameters regarding the target implementation technology. A change in the silicon process or in the available cell library can be a major roadblock for RTL IP use. However, when we distribute IP blocks in high-level language that can be synthesized with an HLS tool, a vast new world of possibility opens up for us. Using the same IP block, we can create a wide gamut of architectures, optimized for latency, throughput, power, area, or whatever our key constraints may be. We can also easily adapt to different changes in the process or library that affect cell-level timing (something RTL IP cannot readily do). And, if we have source code, changing the behavior of HLS IP is much more straightforward than trying to reverse-engineer complex RTL to make a tweak.
Clearly HLS IP would be a major improvement in the world of re-use. And, it would bring a major change to the HLS market. Today, HLS is mostly used for designing at the block level. That means even big companies often have only a few HLS users – the gurus who are crafting the functional blocks for the rest of the team. But if we all start consuming HLS IP, we’ll all need HLS tools. It moves HLS from being an IP-producing tool to also being an IP-consuming tool. If you’re in the business of selling HLS tools, that could be very good news indeed.
Cadence says they are focusing on this high-level IP model and working to provide an environment that facilitates the creation, management, and re-use of high-level IP. Since the company has also recently made a big investment in becoming a major IP supplier, the combination seems natural. Stratus itself looks to be a solid contender in the HLS race. It encapsulates many proven, successful Cadence technologies such as the Encounter RTL Compiler. It is designed to work smoothly in the Cadence design, implementation, verification, and IP re-use flows, and it should be a natural evolution for design teams already fluent in Cadence’s tool suite.
It remains to be seen how the high-level IP story will play out in the real, practical world. But, if Cadence can pull it off, it will certainly usher in a new era in HLS adoption – moving the technology from early mainstream to commonplace. It’ll be interesting to watch.
Nice overview Kevin.
I often wonder if an HLS flow would be able to produce an implementation that has run-time flexibility. Maybe this is a system-level architecture issue: i.e. you compose sets of blocks implemented by HLS and orchestrate them with software to do different things in a given domain?
What would be really great is HLS that could produce a block that is programmable in some way: i.e. the output of HLS would not be a fixed FSM to control the datapath (do the control in the case of control-oriented requirement). Does that sound interesting?
It’s been nearly 20 years down this path, nice to see it realized productively with broad acceptance for large projects
It wasn’t even 10 years ago while our team was doing FpgaC, that lead companies like Xilinx were openly rude and hostile to the concept. Today (after dealing with some NIH) the company as done an about face, and offering fair/good HLS tools to it’s customer base.
FpgaC is pretty primitive by today’s standards, but even with that, and some novel coding styles to infer pipelines, we built some very large dense high speed RC5 challenge FPGA engines that clocked at near max device speed. Fast enough it was impossible to keep the larger Xilinx FPGA’s cool. All coded in clean ANSI C99 and runtime verified on traditional cpu architectures.
Implementing OpenMP and OpenCL languages takes the process farther down the HLS path, allowing higher degrees of parallelism to be inferred for complex algorithms and control structures. Augment HLL with hard trace/profile data, to extract the backing data for HLS time/space trade-offs, for a cost reduced, performance optimized implementation.
The insight many miss, is that the HLL syntax (C/C++/SystemC/OpenCL) provides clear control inferences with if/then/else, switch/break, while/for, and “?”/”:” operators to be applied to data paths as narrow as a single bit (reducing into combinatorial logic) and as wide as the largest native and aggregate data types the language offers. Not that different than VHDL/Verilog, but clearly with a very different inference mindset.
A C algorithm operating on 1-bit wide variables, is cleanly synthesized as combinatorial logic. A C algorithm operating on N-Bit wide data, becomes a control structure guiding an N-Bit data path between function blocks. All without the built-in bottleneck and serialization of a traditional processor/memory.
Pipelining is easily inferred to minimize the combinatorial depths of the control/data structures.
And as long as the HLS is correct, the code can be verified/simulated on a traditional computer architecture with some stubs to emulate the external logic interfaces.
Complex logic, that isn’t in the critical path, can actually infer a small CPU/memory, or thousands of small cpu/memory sets … as a time/space trade-off. Or use hard macros for such.
It’s a new world … we’ve just scratched the surface of complexity. Consider an HLS tool that accepts compiled binary for an ARM/Intel/PIC/ATMega cpu as the HLL, possibly the flash image for one of those SOC CPU’s, and produces a highly optimized hard macro for that application code/device. Augment that “HLL” with hard trace/profile data, to extract the backing data for time/space trade-offs, for a cost reduced, performance optimized device. All automated and turn-key.