feature article
Subscribe Now

Going for Speed

Picking Processors for Performance

It may be surprising that such careful engineering attention is required to gain the desired performance from this lowest echelon of racing – cars powered only by gravity and driven by kids aged 9 to 16. It might also be counter-intuitive that some of the most careful engineering for processor performance comes in the very lowest echelon of computing systems – devices where many tiny processors may be put on a single chip in a system that might sell for single-digit dollars at the retail level. Both, however, are driven by the same constraints. When power, size, and cost are all at a premium, engineering excellence in the extreme is a mandate.

The quest for performance in embedded computing is multi-dimensional. As with the soapbox racer, a complex tradeoff space exists where conflicting forces create the need for a careful balance of resources in order to find an optimal solution to the problem. Ironically, the desktop processor designers like Intel and AMD have it comparatively easy. Only recently have they needed to resort to architectural solutions beyond those that naturally fall out of process improvement. The embedded market, because of its more constrained environment, has reached architectural limits much sooner.

In embedded computing, performance has many enemies. The most prominent of these is power. Many embedded applications are required to run on batteries, or need to fit below a minimum supply current limit, or don’t have the form factor that can accommodate extensive cooling provisions. As a result, the desktop processor solution of cranking the clock frequencies into the multiple gigahertz range isn’t practical. Higher clock frequencies equal more power dissipation, and embedded processor architects long ago had to head for greener pastures. Additionally, higher processor frequencies either require much more memory bandwidth or create a system-level imbalance where the processor is faster than the memory.

Another alternative is to increase the processor width. If you’ve hit the maximum frequency you can tolerate in your system, crunching more bits with each clock cycle is one way to increase the throughput. More bits of width means toggling more transistors, however, so you won’t necessarily be gaining on the power problem any. Also, the wider you go, the more you increase the inherent waste in the system architecture. Any operation that requires less than the full width ends up wasting wiggles, and each unneeded flop that flips is more power, area, and cost that you can’t afford in a tightly constrained embedded design. On top of that, the software architecture of a system is tightly coupled to the bit width, so for most applications, there is a very real optimal width that doesn’t offer much room for tuning.

Once we’ve exhausted the low-hanging fruit of speed and width, we get to pipelining. By parallelizing the execution of our instruction pipeline and looking ahead to operations coming down the road, we can gain a significant efficiency improvement in our system. Instead of being surprised by memory access requests, our processor can be working productively on one task while the pipeline phones ahead for reservations on the next. This too has been a desktop computing weapon of choice for many generations now. Here too, however, there are limits to the effectiveness of the technique. The more pipeline we build into our processor, the more logic it requires to do the same task, and our marginal return on extra logic diminishes as we turn up the pipelining knob. In embedded systems, where logic area is at a premium, the use of pipelining is again matched to the rest of the system architecture and doesn’t give the one-dimensional control we’d need for tradeoff-free acceleration.

With frequency, bit width, pipeline depth, and memory bandwidth all balanced and optimized, where do we go for more speed in our system? We have to break the bounds of monolithic processor myopia and multiply our options. Multi-processors, multi-cores, and multi-threading all let us multiply our performance strategically without building an out-of-balance computing system and thus breaking the sweet spot that we’ve achieved. .

The smallest step we can take to multi-fy our system is to use a multithreaded processor or core. Multithreading gives you something akin to a processor and a half, where some resources are shared and other parts of the processor are duplicated for parallel execution. Most embedded applications have a number of tasks that need to be performed simultaneously. In a superscalar architecture, the pipeline frequently stalls waiting for, say, memory access or the UPS truck or something. While it’s busy waiting, your processor could context swap to another process thread and make use of the downtime. Multithreading is useful when some parts of your system are inherently out of balance. In many applications, these elements are processor clock frequency and memory bandwidth, which may be constrained by other aspects of the system design. An example of an embedded processor core that uses the multithreading approach is MIPS 34K architecture, announced early this year.

If your system tends more toward random multitasking, i.e. the nature of the processes executing in each processing thread is not well understood at the time the system is designed, a multi-core solution may be more appropriate. While multithreading re-uses some of the processor logic, making for a smaller footprint than two full large processors, random multitasking can sometimes cause resource conflicts due to task interactions that take away the advantages of multithreading. In these cases, it’s often better to take on the overhead of full multiple cores.

John Goodacre, Multiprocessing Program Manager at ARM explains, “The inevitability of multithreading is that the software has to understand how to take advantage of the particular multithreading architecture in order to get the most work done. With a multi-core approach, there is more scalability and less requirement for the software to accommodate the scheme.” Compared with multithreading, multiprocessing duplicates more of the processor logic (or all of it) so that separate processes can execute completely independently. “Doubling cores can even more than double performance in some cases due to OS overhead for time slicing plus cache sharing,” Goodacre continues. “With that advantage, you can sometimes use a much smaller processor, saving both area and power.”

Multiprocessing schemes offer considerable advantages when it comes to power management. When multiple processes are executing on multiple processors, each processor can be scaled to just the required performance for its designated task, and processors can even be shut down when their process is not required. With a large, monolithic processor, such optimizations are not possible, and the processor must be sized to match the peak processing demands of the system. When the system is operating off peak, the power penalty of the large processor must still be paid.

Once we’re comfortable with the idea of multiple processors being scaled in size for different types of tasks in our embedded system, we can make the next logical leap to customizing the processor architectures for particular tasks as well. As soon as we head down this path, of course, we are giving up the advantage we just gained of software-agnostic processing. Developers of an application for heterogeneous multi processing must absolutely understand the architecture of the target computing environment to take advantage of its inherent efficiencies.

One common approach to heterogeneous multiprocessing is to tailor processors to specific tasks by generating optimized processors using a scheme such as Tensilica’s Xtensa. In these architectures, the processor resources are molded to the task, including custom instructions for performance-intensive functions. In this case, the game is flipped. Instead of needing to understand the processor to design the software, knowledge of the software is required when designing the processor. The result, however, is a highly efficient processor that is well matched to a specific task in a multi-processing system.

Taking the heterogeneous idea to the extreme involves generating datapath accelerators directly in hardware for extreme performance hogs such as signal processing and video processing tasks. In these architectures, even custom instructions put too much demand on a host processor for passing data to the custom processing element. In such cases, the accelerator uses shared memory to directly retrieve and process data, pushing the results back into FIFOs or shared memory locations where it can be accessed by other processing elements. Designing these systems is the most challenging task of all, from both a hardware and software perspective.

Recently, high-level synthesis tools have been developed that assist with this task by compiling C code directly into highly parallelized hardware for algorithm acceleration. These monster datapath elements may include hundreds of arithmetic units that can operate in parallel, giving orders of magnitude acceleration in compute-intensive algorithms. The design challenge, however, is interfacing these blocks to the rest of the embedded system in a way that doesn’t just move the bottleneck. Careful construction of communications protocols and data access is required to take advantage of the full performance and efficiency potential of a hardware accelerator scheme.

None of these solutions is a panacea for performance and power problems in embedded systems. In each case, the particular demands of the application must be carefully weighed against the strengths of each architectural approach. Just as Toby and his dad want to build a race car that will fit the driver, the track, and the rule book while achieving maximum performance, we need to tailor our embedded system design to the particular performance demands of our application. If we don’t, our competitors will.

Leave a Reply

featured blogs
Nov 22, 2024
We're providing every session and keynote from Works With 2024 on-demand. It's the only place wireless IoT developers can access hands-on training for free....
Nov 22, 2024
I just saw a video on YouTube'”it's a few very funny minutes from a show by an engineer who transitioned into being a comedian...

featured video

Introducing FPGAi – Innovations Unlocked by AI-enabled FPGAs

Sponsored by Intel

Altera Innovators Day presentation by Ilya Ganusov showing the advantages of FPGAs for implementing AI-based Systems. See additional videos on AI and other Altera Innovators Day in Altera’s YouTube channel playlists.

Learn more about FPGAs for Artificial Intelligence here

featured paper

Quantized Neural Networks for FPGA Inference

Sponsored by Intel

Implementing a low precision network in FPGA hardware for efficient inferencing provides numerous advantages when it comes to meeting demanding specifications. The increased flexibility allows optimization of throughput, overall power consumption, resource usage, device size, TOPs/watt, and deterministic latency. These are important benefits where scaling and efficiency are inherent requirements of the application.

Click to read more

featured chalk talk

Infineon and Mouser introduction to Reliable Solid State Isolators
Sponsored by Mouser Electronics and Infineon
In this episode of Chalk Talk, Amelia Dalton and Daniel Callen Jr. from Infineon explore trends in solid state isolator and relay solutions, the benefits that Infineon’s SSI solutions bring to the table, and how you can get started using these solutions for your next design. 
May 28, 2024
36,501 views