The von Neumann architecture is a miracle of efficiency if you count the algorithmic complexity that can be completed by any given number of transistors. If you’ve got enough transistors to create a 32-bit processor plus peripherals plus enough memory to store a decent size program, you can execute an enormously complicated algorithm.
Where von Neumann isn’t so efficient is in the amount of computation for a given amount of power, or in the number of computations in a given amount of time. Those battles are won handily by custom, parallel hardware like we might create in an FPGA, or in a custom, algorithm-specific block in an ASIC or custom SoC. Optimized hardware that specifically implements our algorithm will always win in terms of speed and power – at a cost of vastly increased transistor count.
Throw these two abstract realities on the backdrop of Moore’s Law, and you can see what happens. Every couple of years, the cost of transistors drops approximately in half. We can get double the transistor count on the same size piece of silicon, so the size and complexity of algorithm that could be implemented in parallel hardware doubles. A few years ago, if we had a complex operation represented by a bunch of software, we could afford to take only a small, critical function or two out of that operation and implement them in hardware. With each passing process node, however, the number of transistors available for hardware implementation doubles, and so does the amount and complexity of what would have been software but can now be hardware.
Of course, gaining the benefits of moving software into hardware costs something more than just a few orders of magnitude more transistors. It costs design time and effort. Overall, when we put custom hardware implementation in a balance scale, on the plus side of the scale, we have enormous gains in performance and power efficiency. On the minus side, we have orders of magnitude more transistors/cost, significantly higher design effort, and less system flexibility.
As we mentioned – Moore’s Law is constantly making the first item on the negative side better. To address the last two items, we have high-level synthesis (HLS) plus FPGA fabric. The myth and the goal of HLS is that we can take our software algorithm, run it though our magic high-level synthesis tool, and out pops an optimized, parallelized, super-efficient hardware implementation of that algorithm that we can plop down in an FPGA. That magic C-to-hardware transformation is what HLS has been promising for more than two decades.
If you ask a panel of experts (which I have done on several occasions,) you will find opinions ranging from “We can do it today!” to “It will never happen.” Why the range of answers? On the plus side of the scale (our scale is getting a workout today, isn’t it?) there are several tools in production use today that can take untimed algorithms written in carefully constructed C or C++ and turn them almost magically into high-quality synthesizable RTL. We have written about this many times before, of course, and we’ve even written about BDTi’s benchmarking and certification program where they set about proving it.
Those on the “it will never happen” side of the scale, however, are quick to point out that this is not the mythical beast of software transformed magically into hardware by some omnipotent compiler. These tools require significant hardware expertise on the part of the user. One must understand concepts like pipelining, loop unrolling, latency, throughput, fixed-point math, quantization, resource sharing, and other hardware-centric concepts in order to write the code, control the tools, and understand the results.
The “we can do it today” crowd seems to get more nearly correct with each passing year. Every year, we see new tools on the market, significantly more design experience with the old tools, and improved results reported by those using HLS in production. The tools also seem subjectively to be less sensitive to coding style in the original C/C++ – they now support various dialects from custom languages that use C-like syntax to ANSI C/C++ to SystemC.
The “It will never happen” folks also make a compelling point, however. If we are expecting C-to-FPGA to ever behave like a software compiler, we’re overlooking an important fact about the difference between hardware and software. For a software compiler, there is always something that could be agreed upon as a “best” solution. Compiler developers can tune away – trying to minimize the size and maximize the speed of the generated code. The right answer is reasonably easy to quantify. Optimization choices made during software compilation have at best a modest effect on the results. For the sake of argument, maybe zero to 20% plus or minus.
In hardware architecture, however, there is a gigantic range of answers. The fastest solution might take 1000x the amount of hardware to implement as the densest one. The lowest power version might run at a tiny fraction of the maximum speed. The size of the design space one can explore in HLS is enormous. Implementing a simple datapath algorithm in an FPGA, for example, one might choose to use a single hardware multiplier/DSP block for maximum area efficiency – or one might have the datapath use every single available DSP block on the chip – which can now range into the thousands. The cost/performance tradeoff available to the user, then, could be in the range of three orders of magnitude. The “best” answer depends on the user’s knowledge of the true design goals, and how those goals map down to the particular piece of hardware being implemented with HLS. Unless the user has a way to express those design goals and constraints and percolate those down into the detailed levels of the design hierarchy, an HLS tool has almost zero chance of guessing the right answer. It is NOT like a software compiler.
For years, the challenge users threw down to HLS providers was “results must be as good as hand-coded RTL.” This is a worthy goal, and reminiscent of what the hand-assembly crowd expected of the software compilers trying to woo them into high-level languages. However, many HLS tools have now achieved and surpassed that goal. In numerous production reports, HLS tools have delivered results equal or superior to hand-coded RTL – and with a tiny fraction of the design time and effort.
Other, less obvious challenges for HLS have also advanced significantly. Early HLS focused almost completely on datapath and control optimization to match or exceed hand-coded microarchitectures. Interfacing those auto-generated datapaths to the rest of the design, getting data into and out of those datapaths, and creating an automated method of verifying designs done with HLS were all “exercises left to the user.” Today’s tools are much more robust – with rich feature sets for hierarchical design, interface synthesis, verification automation, memory interface management, and much more.
The remaining challenge for C-to-FPGA HLS tools is handling the wide variety of user expertise. While some HLS users are already happy with the ease of use, these users are most likely hardware-savvy HDL designers who use HLS as a power tool for creating better RTL more rapidly. Because they are already intimately familiar with both the source of the HLS tool and the expected output, they are well-qualified pilots who can use HLS to get from point A to point B much more efficiently and effectively.
On the other end of the spectrum, however, are software engineers with little or no hardware expertise, no understanding of HDL, and often massive amounts of legacy code as a starting point. Their goal would be to identify portions of that software suitable for HLS implementation in hardware, and to use HLS to get there efficiently. As of today, those users are still probably going to be disappointed by HLS.
HLS is currently enjoying its highest level of investment in history. More companies are putting more resources into creating and refining HLS tools than ever before. More users are trying and adopting HLS technology, and many already have years of experience using it in a production engineering environment. The marriage of HLS and FPGA is one of the most promising combinations we’ve ever had to loosen the monopoly that von Neumann has on computing and to open us up to a world of vastly increased performance and efficiency.