feature article
Subscribe Now

Changing Waves

Moving from Moore to Multi-core

For over four decades, progress in processing power has ridden the crest of a massive breaker called Moore’s Law. We needed only to position our processing boards along the line of travel at the right time, and our software was continuously accelerated with almost no additional intervention from us. In fact, from a performance perspective, software technology itself has often gone backward – squandering seemingly abundant processing resources in exchange for faster development times and higher levels of programming abstraction.

Today, however, Moore’s Law may be finally washing up. Even though physics may stick with us through a few more process nodes, it’s pretty clear that economics will probably turn us off first. From a thermal perspective, we’ve already hit the wall with heat on high-performance processors, making multi-core strategies more appealing than trying to squeeze another few flops out of a single-thread processor. It’s almost time to change waves. Before Moore’s breaker crashes on the beach and we get caught inside, we need to cutback and bail – slide down off the wave and paddle back out, dipping through a few mushier swells looking for the next bluebird breaking offshore that can bring us in even faster.

As embedded designers, why should we care? After all, aren’t we on the back end of the train that’s led by exotic supercomputers, trickles down through networks of pedestrian personal computers, and ends with our often cramped, constrained embedded applications? The answer is “probably not for long.” One might argue that many embedded applications hit the power/performance threshold even before high performance computers, the latter benefiting from plentiful power supplies and copious cooling while the former is often restricted by limited battery power, tight enclosures, and miserly BOM budgets. We have to resort to more elegant software and architectural remedies when we can’t just widen the bus, double the gate count, or crank up the toggle rate.

Multi-core is, of course, a hardware architecture solution. While Moore’s Law has been obeyed predominantly by continual progress in process technology, there have been significant hardware architecture improvements along the way as well. Those improvements gave us an extra boost of speed now and then, even if we hardly noticed it in the euphoria of process-driven, Moore’s Law mania. With multi-core and other architectural optimizations, however, the hardware guys will start needing some serious help.

Historically, the hardware folks have carried the ball on performance improvement. Process and architecture delivered progress, and software took advantage of it. The playing field is now leveling. It may be time for software folks to stop freeloading and take a turn, grinding out some of the next gigaflop themselves.

The primary practitioners of software-driven performance improvement have always been the compiler folks. You’ve seen them, walking the hallways with a far-away look on their face, still wearing yesterday’s t-shirt and trying to figure out why the performance trick they spent the past two months implementing squeaked out a scant 0.1% improvement in the thousand-test performance suite (when they were certain it would be good for at least 2% overall, and much more in some cases.) They’ve toiled away for years at a huge heap of heuristics, working to find every tiny tweak to the recipe that will make more efficient use of the processor than their best previous effort. Until recently, it’s been a battle of diminishing returns, with each victory harder fought and each reward proportionally smaller than the last.

Now, however, we could be at the dawn of a new heyday in software-based computing acceleration. Sure, hardware architects can throw down a few more MPUs and LUTs that can theoretically be leveraged to parallelize our processes, but we need newer and vastly improved software tool sophistication in order to take full advantage of that extra capability. For reconfigurable, architecture-free hardware, the problem is even more perplexing. Suddenly, the path from algorithm to executable is shifted away from the well-worn trail of the monolithic Von Neumann machine toward a vast uncharted sea of optimization opportunities. Parallelizing compilers and OS technology, software algorithm to hardware architecture synthesis, and custom instruction and processor creation technologies have the potential to gain us as much performance over the next four decades as tens of billions of dollars of process improvement investment have over the past four.

What we have to realize, however, is the right time to shift our investment priorities. Teams developing optimizing compiler technology are funded at a level at least a factor of 100 less than teams engaged in improving semiconductor processes. The inertial rationale behind this has always been to go back to the well for more of the stuff that has sustained us thus far – to keep riding the wave that has propelled us forward for four decades. At some point, however, and it might be now, the performance improvement per incremental research dollar will be much higher for investment in new compiler and hardware synthesis technology than in semiconductor process refinement.

The new question might not be how to fit more transistors on a die while coaxing each of them to burn less power. Instead, we might need to be working to get ever more efficient use out of the billions of transistors we can already produce. If the power consumption and operating frequency of the human brain are any clue, significant performance improvement is still to be available under the current laws of physics if we modify our approach to optimization.

The transition will not be trivial, however. It goes as far as the roots of our programming paradigm. Certainly C and C++ are the dominant description languages for algorithms today, but many of us have consequently fooled ourselves into believing that these are the most natural way for humans to express an algorithm. If they were, elementary mathematics textbooks would be filled with C code, and The Mathworks would not be in business. In reality, the C language is a moderately cryptic, arcane (and thin) abstraction layer on top of a Von Neumann processor architecture. Hardware constructs like registers, program pointers, memory, and data and address busses are barely buried beneath its scant syntax. If we are to free ourselves from the constraints of the current hardware architecture, we must also at least consider abandoning some of the restrictions inherent in the popular dialect as well.

Of course, the most efficient, logical or descriptive language has never demonstrated much of an advantage in capturing the tongue of the public. English is a perfect example of a sub-optimal syntax surviving well beyond any reasonable rationale. In the case of performance optimization, however, our language and thought semantic need to be called into question. Do people naturally think of algorithms in small sequential steps? If so, we need compilers that can efficiently extract parallelism from sequential behavioral descriptions, and some evolution of C may be a reasonable choice. If not, we need to derive a better dialect for describing algorithms in parallel. If you look carefully at the semantic shift from procedural C to object-oriented C++, you can see that the shift is possible. However, if you also look at the realistic adoption rate of true object-oriented use of C++ (versus using C++ as a more forgiving flavor of procedural C), you see that such a change in mainstream thinking is slow at best.

Embedded computing systems will be forced to shift to more thermally efficient alternative hardware architectures such as multi-core processing and acceleration with reconfigurable fabrics, probably even before normal general-purpose computers (although the HPC community still seems to be blazing the trail for now). As such, we as embedded designers may need to change waves even before our buddies in the “normal” computing world. Our software development tools and paradigms need to be improved dramatically. We should think and plan accordingly. The beach is coming up fast.

Leave a Reply

featured blogs
Nov 22, 2024
We're providing every session and keynote from Works With 2024 on-demand. It's the only place wireless IoT developers can access hands-on training for free....
Nov 22, 2024
I just saw a video on YouTube'”it's a few very funny minutes from a show by an engineer who transitioned into being a comedian...

featured video

Introducing FPGAi – Innovations Unlocked by AI-enabled FPGAs

Sponsored by Intel

Altera Innovators Day presentation by Ilya Ganusov showing the advantages of FPGAs for implementing AI-based Systems. See additional videos on AI and other Altera Innovators Day in Altera’s YouTube channel playlists.

Learn more about FPGAs for Artificial Intelligence here

featured paper

Quantized Neural Networks for FPGA Inference

Sponsored by Intel

Implementing a low precision network in FPGA hardware for efficient inferencing provides numerous advantages when it comes to meeting demanding specifications. The increased flexibility allows optimization of throughput, overall power consumption, resource usage, device size, TOPs/watt, and deterministic latency. These are important benefits where scaling and efficiency are inherent requirements of the application.

Click to read more

featured chalk talk

RF Applications in Satellites
Sponsored by Mouser Electronics and Amphenol
In this episode of Chalk Talk, Maria Calia from Amphenol Times Microwave Systems, Daniel Hallstrom from Amphenol SV Microwave and Amelia explore the use of RF technologies in satellites, the biggest design concerns with connectors and cabling in satellite applications and why RF cable failure mechanisms can be managed by a careful selection of materials and manufacturing techniques.
Nov 12, 2024
13,581 views