There are two alternative realities out there in computation: the sequential universe – which is where our brains naturally conceptualize most algorithms, and the parallel universe – which is the way actual hardware works. These two realities have always co-existed, but the von Neumann architecture shielded us from them. Since our processor architecture was based around a program counter, we could conceive that our machine was doing one instruction at a time.
Therefore, when we wrote our software – even at a much higher level of abstraction than the instructions the machine was executing, the notion that one event would follow another in our algorithm was protected. Of course, as electronic engineers, we understand that there was a “man behind the curtain” in that scenario. We had entirely parallel hardware pretending to do one thing at a time in order to keep us sane and organized in our algorithm development.
When hardware description languages came along, however, that notion of code always being sequential had to be thrown out the window. The metaphor of instructions being executed one at a time simply did not work for the creation of arbitrary logic networks. As a result, VHDL and Verilog are almost impossibly difficult to understand and use, compared with normal high-level software development languages. And it’s too bad, really. Because if the general programming population could code as efficiently in VHDL and Verilog as they do in C, FPGAs (or something like them) would have probably long ago taken over all serious computing.
The advent of object-oriented programming (OOP) warped the wall between these sequential and parallel domains. The very notion of an object, with its own methods and private data space, was largely incongruous to the idea of a single processor thread executing instructions one at a time. Objects exist and work in parallel, and compiling object-oriented code down to something that simulated that parallelism on a sequential machine was a major challenge in the creation of object-oriented compilers. Still, OOP provided a wormhole through the dimensional barrier between the sequential universe – where we had historically conceived algorithms in procedural programming models – and the parallel universe where masses of logic gates flipped back and forth almost simultaneously.
Now, however, we face the opposite problem from OOP. Where OOP was trying to map parallel concepts onto sequential hardware, we need to map algorithms that were conceived as sequential into vastly parallel optimized hardware structures. If we can do this, we can take the sequential code of legacy software and move it automatically into the domain of custom hardware implementation. As we have discussed many times in these pages, the rewards of this are spectacular – orders of magnitude improvement in compute performance – and particularly performance-per-watt, completely independent of improvement in the underlying semiconductor technology.
When we conceive an algorithm as a set of instructions to be executed in sequence, we humans tend to ignore opportunities for parallelism and pipelining. We have one body and two hands, so we generally do real-world physical tasks in steps. If we were making cookies, for example, we would add the flour, then the sugar, and then the salt. Then we stir. Then we form the result into (pretty boring) blobs and arrange them on a pan and pop them in a hot oven. If we were thinking a little about opportunities for parallelism, we probably started the oven pre-heating before all that. But, if we wrote “code” for this process, and our code says “add flour, then sugar, then salt, then stir, then make blobs, then arrange on pan, then bake” some information missing from that code is required to determine, for example, whether the order of those steps can be changed, or whether they can all be done simultaneously. We know that flour, sugar, salt could be done in any order, or in parallel, and that “make blobs” must be done after those three. A parallelizing recipe compiler might not know that without additional information.
Thus is the first challenge of making tools that can automatically convert sequential algorithms into optimized parallel architectures. The tools need to be able to figure out dependencies in the data and in the order of operations in order to create a parallel model. There is also the issue of resources, and the utilization of those resources. If we had three cooks, one could be adding flour while a second adds sugar and a third adds salt. More or fewer resources would determine the number and types of things we could or could not parallelize.
In the world of hardware, resources are not infinite. If we have a multiplication operation inside a loop, and that loop is executed one million times, it would help to have a number of hardware multipliers on our chip working in parallel on those problems. But we probably don’t have a million multipliers. Our tool needs to understand the resources available in making decisions around parallelism. Furthermore, those resources could be arranged in pipelines in order to maximize their utilization and increase throughput. If three cooks are constantly mixing bowls of dough while three more are making dough balls and others are constantly carrying pans to the optimal number of ovens, we can see how our boring cookie production could be optimized for maximum throughput.
Of course, in addition to throughput, there is also the consideration of latency. Our goal might be to maximize the number of cookies coming off the end of the line, or it might be to minimize the time from startup until the first batch of cookies is ready to eat. Those would require completely different architectures, and they would require our tool to understand our latency vs throughput goals, which may not have been captured in our “code.”
The point of all this is that the notion of simply compiling legacy (sequential) software into optimized parallel hardware probably requires information that is not normally present in the original source code itself. It turns out that much of this information can be inferred by clever tools, but some of it cannot. And the space of potential implementations is exponentially larger than for a straight-ahead von Neumann implementation.
And this discussion only skims the surface of the problem. Most of what we’ve talked about here is a dataflow situation, but many algorithms are much heavier on conditional behavior and control. These types of algorithms can also be parallelized, but the approach to optimizing them is very different.
This wall between parallel and sequential worlds is incredibly difficult to breach. Humans conceptualize sequentially, and in doing so we omit critical information from our code that is required for optimal parallelism. Automated tools have to guess at the humans’ intent, and they often don’t have the resources to make those guesses intelligently. But requiring the humans to do the actual parallelizing isn’t the answer. Our brains simply don’t work that way, and human-created architectures will always fall short on the optimization front.
Today, optimizing hardware implementations of sequential software algorithms is the domain of human intelligence and high-level synthesis tools. The tool technology has come a long way in the past three decades, but it still has far to go before we conquer the challenge of automatically targeting large, legacy software applications to reconfigurable heterogeneous computing machines that include both von Neumann and alternative hardware architectures such as FPGA fabric. The creation of such a tool is an incredibly complex proposition, but one that has the potential to revolutionize computing itself. Let’s hope our peers who are tackling this challenge have the wind at their backs!
Hmm. Being involved with parallel processing since the late 1980’s, I don’t fully agree even if I understand why the author thinks this way. Humans do not conceptualise sequentially, but have been “brainwashed” to apply sequential recipes at school. And yes, partly because of the von Neumann architecture (I call that the von Neumann syndrome).
Software is modelling and the real world is concurrent/parallel by nature. Hence, thinking in terms of parallelism is quite natural. We have the formalisms in the forms of process algebras with Hoare’s CSP being one of the pioneers that made it into a real processor (the transputer) and programming languages like occam, Virtuoso RTOS, ….
And of course, reverse engineering sequential code is hard precisely because the parallel information that is inherently in the problem domain is more or less thrown away and replaced by a complex state machine.
I had the personal experience years ago when after 10 years of sequential programming (Algol, Pascal, Fortran) I attended a programming course in occam. At some point, my brain made the click undoing ten years of “brainwashing” and since then thinking in parallelism is much more natural than thinking in state machines that no brain can ever keep in his head. Just image you have to write a program that models the conversation of people in a group. is that easy with a state machine? Or does one timslice every millisecond? Model each person as an event driven process and the communication as a message, and thing come clear. Divide and conquer.
I also noticed often that hardware people had a lot less issues with thinking in parallel. They are used to “connect” blocks without having to know the state machine inside. Its sufficient to know the interface, showing that it really is a matter of education and training.
“Design parallel, optimise sequentially” and not only one gets much cleaner designs, but also better ones. Parallelism is not a voodoo science, but natural.
@ericverhulst – I have to agree mostly — it’s actually simpler to explain in real life, but from a systems level view, in terms of successive decomposition.
Junior/Unskilled engineers solve problems with a monolith of sequential logic/code, that for all practical purposes can only be implemented in real life for the smallest of problems.
Senior/Skilled engineers solve problems incrementally with smaller logic/code bodies that work together. And as the problem complexity increases, these small logic/code bodies become additional state machines, threads, tasks, processes, and pipelines in the finished architecture of more complex problem solutions.
This is not new … this has been the state of the art for more than 40 years. We have been building single processor systems with interrupts and partitioned tasks for 60 years. We have been building SMP systems for more than 40 years. We have been building NUMA machines for more than 30 years. We have been building large clusters for more than 25 years.
I introduced general lockf semaphores into the UNIX/POSIX architecture in 1982 (35 years ago) as a general solution for the deadlock and synchronization problems for cooperating processes. And we added both threads and light weight threads into the UNIX/POSIX architecture shortly after. And with that we created compilers and libraries for re-entrant code bodies, to support shared data/code address spaces for threads, events, and call-backs.
Tasks, interrupts, events, call-backs, processes, threads, messages, mail-boxes, semaphores, locking, client/server are all OLD well understood problems. This isn’t magic, this IS the required tool box of any trained software or computer engineer.
Communication and sequencing is a critical part of designing all systems … systems of people in the work place, and systems of control and data in automated designs. Each requires specific training, plus experience to be effective.
As you note, it’s a mistake not to introduce these naturally parallel concepts from the beginning.
Nobody expects building designers to start with a mountain and carve out rooms and hallways … we start with stones/bricks/sticks, and build the walls for rooms and hallways — as complex as necessary, within the limits of the materials.
We need to teach our hardware/software engineers these same skills from the beginning – KISS for naturally parallel designs.
@kevin – For skilled engineers using OpenMP, OpenCL, CUDA and/or OpenMPI, most Single-Instruction-Multiple-Data (SIMD), Multiple-Instruction-Single-Data (MISD), and Multiple-Instruction-Multiple-Data (MIMD) parallel instruction/data flow problems have been very tractable for nearly a decade now in systems with SMP, SMP+GPU, and SMP+GPU+Clusters.
Intel actively supports OpenMP, OpenCL, and OpenMPI in their SMP, SMP+GPU, and SMP+GPU+Cluster based systems.
Altera (Now part of Intel) also actively supports OpenCL for FPGA’s.
So for Intel based system, SMP+GPU+FPGA+Clusters pretty much covers SIMD, MISD, MIMD problems well.
Apple has supported SMP+GPU+Clusters with OpenCL and OpenMPI for nearly a decade now.
Even Xilinx finally understood they could not avoid OpenCL, and have been on-board for a couple years now.
OpenCL on FPGA’s is only nasty because the backend PR tools are not programmer friendly.
@TotallyLost – agreed, but I think it misses my point a bit.
You say “For skilled engineers using OpenMP, OpenCL, CUDA and/or OpenMPI,”… I agree completely there.
But, I’d argue that is a very small percentage of software engineers. And, none of those address the issue of legacy applications. Rewriting all the world’s legacy software to take advantage of heterogeneous computer architectures is a pretty daunting task. Unfortunately, coming up with tools that could automatically make legacy applications work efficiently on heterogeneous architectures is also a pretty daunting task. There does not appear to be a magic bullet here.
@Kevin – You say “Rewriting all the world’s legacy software to take advantage of heterogeneous computer architectures is a pretty daunting task.”
The reality is very little sequential code will benefit the user significantly, if ported to some parallel platform. And in particular some mixed architecture system, with a mix of ISA’s, GPU’s and FPGA’s. Better than 99% of legacy code, or even new code, is low resource, highly sequential code, with run times that are insignificant on any modern platform. This is the code that most programmers write and support … they are highly unlikely to benefit from being optimized for some mixed architecture system.
The first step in performance optimization is nearly always optimizing memory accesses to be cache efficient — improving locality of access in the algorithm, then packing density to optimize references per cache line fill. With CPU to memory latency ratios typically higher than 10:1, this is the critical “low hanging fruit” for optimization. Without doing this first, most SMP parallel optimizations will remain serially blocked on memory resources (primary sequential bottleneck subject to Amdahl’s law). This skill we teach to sequential programmers — before parallel skills.
The few programs that are still heavily resource driven, with long run times, generally can easily be effectively parallelized with some very minor rewrite/partitioning with OpenMP and/or easily pushed into a mixed architecture system with parallel optimizations — by a skilled parallel professional engineer.
Letting a generic sequential software engineer do the job, that lacks both training and experience in parallel programming is a mistake, unless it is specifically tasked as a training exercise without market lead time concerns.
Hardware engineers have similar tasking boundaries by skill set and experience … we let some engineers design at the board level … we let skilled VLSI engineers design at the chip level. Only an idiot manager would let board level designers take a critical revenue and market lead time project to VLSI … a good hardware manager would hire a seasoned VLSI chip level designer.
A good software manager will hire a seasoned parallel programmer if the project is critical revenue and time to market.
If either a hardware or software engineering manager is not concerned about critical revenue and time to market, it’s sometimes the best choice to train in house team members, assuming that is the best way to allocate staff budgeting.
This is exceptionally low risk/budget when training software engineers good OpenMP skills in-house … for deployment on modern multi-core SMP machines. It’s still a good idea to press staff to take a continuing education class in parallel programming that includes threads, OpenMP, OpenMPI and OpenCL — and then reward that professional commitment with cherry picked in-house parallel projects.
Likewise on the hardware side, we generally migrate board level engineers slowly into large FPGA projects, and then into VLSI projects after they have taken some graduate level VLSI classes as continuing education.
It’s just as unlikely that significant amounts legacy code will need to be parallelized, as it’s unlikely that significant amounts of legacy board level designs will need to be pushed into VLSI.