feature article
Subscribe Now

Multicore to Massively Parallel

“There is a lot of hype about multicore, but there is no infrastructure to support it.” A throw-away remark from a senior figure in an embedded tools company was the start of this article. Certainly he didn’t want to be quoted directly, which made me greatly to wonder. As I started researching the issues I talked to a number of people across the industry. The choice wasn’t scientific or comprehensive – just people I thought might have opinions that were valuable, whose company had recently made an announcement in this area, written an article that turned up in Google, or who I happened to bump into. If I haven’t spoken to you and you have a product that addresses the issues addressed, please get in touch.

The first issue was that of definition – what do you think of as multicore? The term was hi-jacked by Intel with the Core-Duo announcements. Intel argued that Moore’s law meant that power consumption would continue to rise with processing speed. To get round this, Intel proposed replacing a single, watt-burning high-performance processor with two or more lower-powered processors to share the load. Each core has its own Level 1 (L1) cache, and the Level 2 (L2) cache is shared and used for communicating between processors. One way to use the power is to assign the OS, which has to be multicore aware, to one or more cores and assign applications to other cores. This doesn’t present too many difficulties to the developer, and there are rarely any significant issues in execution. (And, as one cynic put it, with most desktop apps, you get used to occasional crashes.)

While this can produce a significant increase in speed, it doesn’t even begin to approach the full potential of the system. While the cores can run fast, L2 cache is slower, and the bus between the cache and system memory is limited to the speed of the memory. This situation will only get worse as more and more cores are added.

Not all applications are going to produce significant improvements through parallel processing. Amdahl’s law, formulated by computing pioneer Gene Amdahl, predicts that the performance of a parallel processing system is determined, not by the amount of code that can be parallelized, but by the amount that cannot. If ten per cent of your application insists on being linear, even if the rest of the code is effectively executing in close to zero time, then the performance speed of that application can be only ten times better than that of a single processor. However, for many embedded systems, there will be the equivalent of multiple applications running, and the limitations of Amdahl’s Law are not going to be as significant as they will be on the desktop.

Freescale has a different approach to multicore. The PowerQUICC III starts with four cores but uses a different approach to the architecture: instead of shared cache memory, a switching fabric links the different cores. In this light, PowerQUICC III can be seen as a link between the dual- and quad-core versions of conventional architectures and the devices that are attempting to produce powerful compute engines by using large numbers of identical cores and massive parallelism — what some are calling the “tiled” approach. Few of the companies building these devices have been very successful. To quote Simon Davidmann, whose new company, Imperas, will soon be making announcements about tools for creating software for multicore devices, “The road to multicore is littered with the bodies of dead multicore companies that didn’t have the software tools in place.”

This view is very much that of David May, CTO of recent start-up XMOS. XMOS will be releasing software design tools in advance of sample silicon, and they argue that they are producing “software defined silicon”. While there are many corpses, some companies still survive, such as PicoChip and Tensilica. And, if last year’s “Hot Chips” conference is an indicator, there will soon be a new round of large-scale tiled offerings, only this time from the established big boys in semiconductors: Intel, for example, was talking about 80 cores in a tiled array.

Many of the tiled products are seen either as the next plateau in desktop machines, cutting in when the current simpler multicore devices peak out, or providing co-processing and acceleration tools for specialist applications, replacing the boards from companies like Clearspeed.

Devices built with multiple homogeneous cores are only a part of the story. Many systems currently use a mix of different processors, and single-chip platforms with multiple cores of different architectures are in widespread use. These, like TI’s OMAP and NXPs Nexperia families, are normally targeted at specific applications. Typically they use an ARM core, a DSP, and specialist engines, such as image processors, to provide the basic building blocks for mobile phones, set-top boxes, and other demanding applications. They usually come with their own tool sets (Code Composer Studio for TI), and third parties have targeted them with versions of their own tools. For example Green Hills offers a wide range of tools, from IDE to RTOS for OMAP.

OK, so we have progressed through from dual cores, through many hundreds of identical processors, to platforms of mixed processing elements. What are the issues for the embedded developer? Most of what follows will mainly look at using devices with a small number of identical cores – devices that were mostly originally developed for the PC and are finding their way slowly into the embedded space.

There has been considerable coverage, or perhaps hysteria, over the perceived difficulty of programming these devices, and there is a consensus amongst the mainstream software community that programming parallel systems is hard. But there are dissenting voices: Iann Barron, who was a pioneer of parallel computing with the transputer, says that it is not difficult, it is just that programmers have been conditioned to think linearly. This brings up an interesting issue – how far is the programmer conditioned in his thinking by the programming language he uses? David May certainly thinks that this is the case, and he advocates that students should be exposed to as many different languages, and types of languages, as possible. “They are different types of tools and should be used for different things,” he says.

However, the embedded engineer is used to coping with concurrency – in most embedded systems, several things are happening at the same time and interacting with each other at different levels. The embedded developer has not been conditioned to think along straight lines and will find it easier to design applications that use multiple cores than will a mainstream programmer. As David Kleidermacher of Green Hills Software has pointed out, “It is common for embedded developers to employ real-time operating systems, and every RTOS in the world has some form of threading primitive.” Unlike in the desk-top and enterprise world, there is not the enormous inheritance of legacy systems that are designed to run through a single processor bottleneck. While there have been attempts, most people with experience of parallel applications believe that it is going to be a very long time before there is a parallelizing engine that will suck in existing code and spit out parallel code.

Legacy embedded systems in real-time areas are then likely to include large degrees of multi-threading and thus be more suitable for multicore adoption, and this explains why the early users of multicore are in aerospace and defense. (They also have the budgets to carry out development.)

Even with existing multi-threading applications, moving from one processor to a multicore may have problems. Think about an application where two threads both read and write data to the same address. On a single core machine they may co-exist happily, but on a dual core machine, if both threads are running simultaneously, each on its own processor, the possibility of conflict becomes significantly close to inevitable. Debugging such an event may not be easy, and the tools to do so are not widely available.

Tools are being developed and deployed to make designing multicore systems from scratch a lot easier. C and C++ are the targets of most efforts in this area, with some work on Java, but the area is handicapped by lack of agreement as to what and how parallelism should be implemented in the language. And although the area suffers from a lack of standards, it certainly doesn’t suffer from a lack of languages. Over ten years ago, Parallel Programming Using C++, edited by Gregory V. Wilson and Paul Lu, described 15 different approaches to implementing parallel processing in C++, and, a year later, a website – now dead – listed over 30 projects. Since then there have been many more attempts – but much of this work is still aimed at the PC. There are communal attempts to establish standards and tools for multiple processor systems from Eclipse and the Multicore Association (MCA).

Eclipse is developing the parallel tools platform (PTP) to provide developers of all sorts of multi-processor systems with a common IDE. The statement of intent says:

The aim of the parallel tools platform project is to produce an open-source industry-strength platform that provides a highly integrated environment specifically designed for parallel application development. The project will provide:

a standard, portable parallel IDE that supports a wide range of parallel architectures and runtime systems

a scalable parallel debugger

support for the integration of a wide range of parallel tools

an environment that simplifies the end-user interaction with parallel systems.

Version 2.0 of PTP is currently a preview release, but there is concern within the community that a completely new tools paradigm needs to be developed to cope with the larger and more complex systems that are exploiting multi-processors. Much of the work is for next-generation systems, and there doesn’t seem to be a great deal in the project for today’s implementations of small numbers of cores.

The Multicore Association, with a membership of some of the key players in multicore, is working to tackle “the most critical roadblocks for developing multicore-enabled systems”. Active work is going intoahigh-performance, low-latency communications API (MCAPI), and there are plans for debug extensions into the MCAPI spec to have an implementation-independent view of the state of transactions in progress and an implementation-independent event log. These APIs will be for heterogeneous multicore systems, so they will be useful in designing systems where a mix of different architectural processors is in use, and they will also facilitate changing processors within a design. There are also APIs under development for Resource Management and Task Management.

The debugging work has so far taken a back seat at MCA, with the effort going into the MCAPI. And the MCAPI, being particularly suitable for heterogeneous systems, and so of great interest to the embedded world, will not provide much for the current, shared-cache multicore systems.

The most recent member of the MCA, at the time of writing, is National Instruments. NI’s LabVIEW embedded development system has always had built-in parallelism, which was exploited by multi-threading OSs or flattened for linear execution. Ian Bell of NI says, “Multiple cores are not a challenge, but concurrency is.” LabVIEW has added the ability to assign threads to processors, either totally or conditionally (depending on the target operating system). An example might be that Cores 1 and 2 run the operating system, Core 3 runs the application, and Core 4 is available to wait for critical threads. It is also possible to extend LabVIEW to instrument, to some extent, the code, providing breakpoints, for example. It can capture the state of all the threads at that point and relate them to the functions that are running. While this is only a limited amount of debugging, today’s LabVIEW does support today’s multicore products.

Express Logic has, with the ThreadX RTOS, always been thread aware, and Bill Lamie of Express Logic says that the RTOS maps well to a parallel architecture and the TraceX analysis tool provides a way of viewing threads and their interaction. TraceX, while not specifically designed for multicore operations, could find a good niche there, providing a different kind of elementary debugging.

Debugging appears to be the weakest point for using multicore systems in the embedded space. The issues that already effect multithreading, such as deadlocks, when two threads are each waiting for the other, and race conditions, when two threads are each trying to be the last to update a specific value, are only going to be worse when threads are spread across multiple cores. Many people I spoke to felt that it was debugging, rather than the more high-profile issues of software development, that was going to hold back the take-up of multicore systems.

The multicore/multiple processor world is like a frontier town in the Old West of the United States: brawling, exciting, and with enormous potential. Some gun-slingers may come out on top for a period. As the use of multiple cores becomes commonplace, so the law, expressed as agreed standards and the tools to use them, will come to the frontier, and things will become quieter. In the meantime, just keep practicing the fast draw.

Leave a Reply

featured blogs
Nov 22, 2024
We're providing every session and keynote from Works With 2024 on-demand. It's the only place wireless IoT developers can access hands-on training for free....
Nov 22, 2024
I just saw a video on YouTube'”it's a few very funny minutes from a show by an engineer who transitioned into being a comedian...

featured video

Introducing FPGAi – Innovations Unlocked by AI-enabled FPGAs

Sponsored by Intel

Altera Innovators Day presentation by Ilya Ganusov showing the advantages of FPGAs for implementing AI-based Systems. See additional videos on AI and other Altera Innovators Day in Altera’s YouTube channel playlists.

Learn more about FPGAs for Artificial Intelligence here

featured paper

Quantized Neural Networks for FPGA Inference

Sponsored by Intel

Implementing a low precision network in FPGA hardware for efficient inferencing provides numerous advantages when it comes to meeting demanding specifications. The increased flexibility allows optimization of throughput, overall power consumption, resource usage, device size, TOPs/watt, and deterministic latency. These are important benefits where scaling and efficiency are inherent requirements of the application.

Click to read more

featured chalk talk

Advantech Dual Band WiFi
Sponsored by Mouser Electronics and Advantech
In this episode of Chalk Talk, Amelia Dalton and Monica Goode from Advantech investigate the what, where, and how of dual band WiFi. They also explore the benefits that dual band WiFi can bring to a variety of embedded designs and how you can take advantage of Advantech dual band WiFi solutions for your next design.
Jul 31, 2024
84,015 views