feature article
Subscribe Now

Changing Waves

Moving from Moore to Multi-core

For over four decades, progress in processing power has ridden the crest of a massive breaker called Moore’s Law. We needed only to position our processing boards along the line of travel at the right time, and our software was continuously accelerated with almost no additional intervention from us. In fact, from a performance perspective, software technology itself has often gone backward – squandering seemingly abundant processing resources in exchange for faster development times and higher levels of programming abstraction.

Today, however, Moore’s Law may be finally washing up. Even though physics may stick with us through a few more process nodes, it’s pretty clear that economics will probably turn us off first. From a thermal perspective, we’ve already hit the wall with heat on high-performance processors, making multi-core strategies more appealing than trying to squeeze another few flops out of a single-thread processor. It’s almost time to change waves. Before Moore’s breaker crashes on the beach and we get caught inside, we need to cutback and bail – slide down off the wave and paddle back out, dipping through a few mushier swells looking for the next bluebird breaking offshore that can bring us in even faster.

As embedded designers, why should we care? After all, aren’t we on the back end of the train that’s led by exotic supercomputers, trickles down through networks of pedestrian personal computers, and ends with our often cramped, constrained embedded applications? The answer is “probably not for long.” One might argue that many embedded applications hit the power/performance threshold even before high performance computers, the latter benefiting from plentiful power supplies and copious cooling while the former is often restricted by limited battery power, tight enclosures, and miserly BOM budgets. We have to resort to more elegant software and architectural remedies when we can’t just widen the bus, double the gate count, or crank up the toggle rate.

Multi-core is, of course, a hardware architecture solution. While Moore’s Law has been obeyed predominantly by continual progress in process technology, there have been significant hardware architecture improvements along the way as well. Those improvements gave us an extra boost of speed now and then, even if we hardly noticed it in the euphoria of process-driven, Moore’s Law mania. With multi-core and other architectural optimizations, however, the hardware guys will start needing some serious help.

Historically, the hardware folks have carried the ball on performance improvement. Process and architecture delivered progress, and software took advantage of it. The playing field is now leveling. It may be time for software folks to stop freeloading and take a turn, grinding out some of the next gigaflop themselves.

The primary practitioners of software-driven performance improvement have always been the compiler folks. You’ve seen them, walking the hallways with a far-away look on their face, still wearing yesterday’s t-shirt and trying to figure out why the performance trick they spent the past two months implementing squeaked out a scant 0.1% improvement in the thousand-test performance suite (when they were certain it would be good for at least 2% overall, and much more in some cases.) They’ve toiled away for years at a huge heap of heuristics, working to find every tiny tweak to the recipe that will make more efficient use of the processor than their best previous effort. Until recently, it’s been a battle of diminishing returns, with each victory harder fought and each reward proportionally smaller than the last.

Now, however, we could be at the dawn of a new heyday in software-based computing acceleration. Sure, hardware architects can throw down a few more MPUs and LUTs that can theoretically be leveraged to parallelize our processes, but we need newer and vastly improved software tool sophistication in order to take full advantage of that extra capability. For reconfigurable, architecture-free hardware, the problem is even more perplexing. Suddenly, the path from algorithm to executable is shifted away from the well-worn trail of the monolithic Von Neumann machine toward a vast uncharted sea of optimization opportunities. Parallelizing compilers and OS technology, software algorithm to hardware architecture synthesis, and custom instruction and processor creation technologies have the potential to gain us as much performance over the next four decades as tens of billions of dollars of process improvement investment have over the past four.

What we have to realize, however, is the right time to shift our investment priorities. Teams developing optimizing compiler technology are funded at a level at least a factor of 100 less than teams engaged in improving semiconductor processes. The inertial rationale behind this has always been to go back to the well for more of the stuff that has sustained us thus far – to keep riding the wave that has propelled us forward for four decades. At some point, however, and it might be now, the performance improvement per incremental research dollar will be much higher for investment in new compiler and hardware synthesis technology than in semiconductor process refinement.

The new question might not be how to fit more transistors on a die while coaxing each of them to burn less power. Instead, we might need to be working to get ever more efficient use out of the billions of transistors we can already produce. If the power consumption and operating frequency of the human brain are any clue, significant performance improvement is still to be available under the current laws of physics if we modify our approach to optimization.

The transition will not be trivial, however. It goes as far as the roots of our programming paradigm. Certainly C and C++ are the dominant description languages for algorithms today, but many of us have consequently fooled ourselves into believing that these are the most natural way for humans to express an algorithm. If they were, elementary mathematics textbooks would be filled with C code, and The Mathworks would not be in business. In reality, the C language is a moderately cryptic, arcane (and thin) abstraction layer on top of a Von Neumann processor architecture. Hardware constructs like registers, program pointers, memory, and data and address busses are barely buried beneath its scant syntax. If we are to free ourselves from the constraints of the current hardware architecture, we must also at least consider abandoning some of the restrictions inherent in the popular dialect as well.

Of course, the most efficient, logical or descriptive language has never demonstrated much of an advantage in capturing the tongue of the public. English is a perfect example of a sub-optimal syntax surviving well beyond any reasonable rationale. In the case of performance optimization, however, our language and thought semantic need to be called into question. Do people naturally think of algorithms in small sequential steps? If so, we need compilers that can efficiently extract parallelism from sequential behavioral descriptions, and some evolution of C may be a reasonable choice. If not, we need to derive a better dialect for describing algorithms in parallel. If you look carefully at the semantic shift from procedural C to object-oriented C++, you can see that the shift is possible. However, if you also look at the realistic adoption rate of true object-oriented use of C++ (versus using C++ as a more forgiving flavor of procedural C), you see that such a change in mainstream thinking is slow at best.

Embedded computing systems will be forced to shift to more thermally efficient alternative hardware architectures such as multi-core processing and acceleration with reconfigurable fabrics, probably even before normal general-purpose computers (although the HPC community still seems to be blazing the trail for now). As such, we as embedded designers may need to change waves even before our buddies in the “normal” computing world. Our software development tools and paradigms need to be improved dramatically. We should think and plan accordingly. The beach is coming up fast.

Leave a Reply

featured blogs
Dec 19, 2024
Explore Concurrent Multiprotocol and examine the distinctions between CMP single channel, CMP with concurrent listening, and CMP with BLE Dynamic Multiprotocol....
Jan 10, 2025
Most of us think we know something about quantum computing, right until someone else asks us to explain it to them'¦...

featured chalk talk

Ultra-low Power Fuel Gauging for Rechargeable Embedded Devices
Fuel gauging is a critical component of today’s rechargeable embedded devices. In this episode of Chalk Talk, Amelia Dalton and Robin Saltnes of Nordic Semiconductor explore the variety of benefits that Nordic Semiconductor’s nPM1300 PMIC brings to rechargeable embedded devices, the details of the fuel gauge system at the heart of this solution, and the five easy steps that you can take to implement this solution into your next embedded design.
May 8, 2024
39,119 views