Excuse me while I juggle these flaming chainsaws. While riding a unicycle on a tightrope crossing over Niagara Falls. Blindfolded. Challenging enough for ya?
That’s essentially what a new company called Soft Machines is attempting. It’s a new firm with an entirely new microprocessor design that is taking on the two toughest challenges in the business: how to increase performance while reducing power, and how to run programs written for other processors. Oh, and they’re competing with ARM for embedded RISC processor cores. And then they’ll be taking on Intel and AMD with x86 processors. Challenging enough for ya?
It’s not every day you get to see a brand new microprocessor company. What do you think this is – 1998? Yet Soft Machines thinks it’s cracked the secret code to making embedded processors that are both fast and small, quick yet power-efficient, new yet totally compatible with existing binary code.
It calls its new processor VISC (which doesn’t stand for anything), and it’s set to begin licensing the design next year. Like ARM, MIPS, and other IP vendors, Soft Machines will license VISC to SoC designers looking to embed the CPU core(s) into their own chip designs. What Soft Machines promises is better absolute performance than the other guys on single-threaded code, while also delivering better performance-per-watt.
Perhaps most surprising of all is that VISC runs ARM software, even though Soft Machines isn’t an ARM licensee. Other companies have created clean-room “ARM clones” before, but they’ve all failed due to either technical shortcomings or legal entanglements. ARM’s architecture is encumbered by patents, as are most microprocessors, so it’s legally tricky to duplicate an ARM-compatible CPU without a license. Undeterred, Soft Machines thinks it’s found a way around both the legal and the technical roadblocks. And if the company’s recent technology demonstration is any indication, they’re correct. A hardware prototype board did indeed seem to run unmodified ARM code. Quite well, in fact.
And that’s just the start of what makes VISC so unusual. The company has rethought almost everything about internal CPU design orthodoxy. Some of these techniques we’ve seen before in other (mostly unsuccessful) CPU creations. Others appear to be unique. Next week we’ll delve into VISC in detail. For now, let’s look at the challenges VISC faces, starting with its ARM compatibility.
In a nutshell, VISC does hardware binary translation in real time. That is, it executes unmodified software compiled for an ARM Cortex A-series processor, fetching the binary instructions and then converting them to its own internal instruction set. It does this on the fly, in hardware. The overall concept isn’t new, but it is difficult and fraught with problems. Perhaps the best examples of this approach are Intel’s and AMD’s own x86 processors. For many years, those chips have converted x86 instructions into a proprietary and undocumented set of RISC-like pseudo-instructions before being dispatched to the chips’ internal execution resources. The conversion is invisible to programmers and happens on the fly. Without a detailed circuit diagram, you’d never know the chip wasn’t actually running x86 code.
VISC does something quite similar, but with ARM code. At least, for now. Soft Machines says an x86 version is in the works, and a native Java version is a possibility. Other conventional CPU architectures (MIPS, PowerPC, etc.) might follow if demand dictates. In other words, if enough people are willing to pay for it, Soft Machines thinks it can do the necessary engineering.
Binary translation has a long and checkered history. Everybody likes the idea of executing code for Processor X on Chip Y. If we could all move code around regardless of CPU, the world would be a better place. That’s what virtual machines (e.g., Java, C#) were supposed to deliver, but famously don’t.
The problems are akin to those of simultaneous translation of human languages at an international diplomatic meeting. With human languages, the trick is to capture the nuances of meaning: the idioms, clichés, and expressions that don’t come across literally. The classic (and possibly apocryphal) example of an automated English-to-Russian translation system interpreted the English expression, “the spirit is willing but the flesh is weak” into the Russian equivalent of “the vodka is good but the meat is rotten.” Translation is harder than it sounds.
Translating computer languages is harder still, because there’s no room for interpretation (so to speak). Computers follow instructions literally. Translating computer binaries is not like lobbing hand grenades. It’s not okay to just get close. You have to be exactly on target, all the time, or the program misbehaves. The academic world is littered with the corpses of graduate students who embarked on binary-translation projects, buoyed by early success, only to get bogged down in the last 10%. The devil dwells in the details.
Accumulated disappointment and failure notwithstanding, binary translation does occasionally work. But the very few exceptions serve only to prove the rule. Apple successfully translated most 68K binaries to PowerPC when it switched processors in 1994, and then it pulled the rabbit out of the hat again when it switched to Intel processors. DEC could convert x86 binaries to Alpha code on the fly, and IBM was able to translate to and from various machine architectures. But all of these examples used software to translate from one instruction set to another. Soft Machines is doing it entirely in hardware. What these examples also have in common is a very large and well-funded development team with a pressing commercial reason to make it work. What chance does a startup have?
Well, Soft Machines isn’t really a startup. Not in the usual sense, anyway. The company is actually seven years old, and it has been working on nothing but this project the whole time. It’s also accumulated (and spent) $125 million in venture capital. And it currently boasts 250 employees. Hardly two guys in a garage. A couple hundred people with a nine-figure budget can accomplish a lot in seven years.
VISC’s other claim to fame is that it can crack the IPC (instructions per clock) barrier holding back other processors. In other words, VISC can extract more performance out of an existing ARM program than ARM itself can do. It does this by finding and exploiting parallelism that other processors leave on the table. It’s a bold claim, and also one that other vendors have made.
Everybody wants to parallelize programs: to break them into separate threads that can execute independently. But that’s easier said than done, and several factors conspire to make it difficult. For starters, most computer programs simply don’t work in parallel. The task they’re performing is inherently serial, with each step logically following another. For example, if you need to add two numbers together and get the sum before multiplying that by a third number, there’s no point in launching the multiply until the addition is finished. Just as nine women can’t produce a baby in one month, some tasks just can’t be parallelized to save time.
Even on tasks that can be threaded, the opportunities are often hidden from view. It’s not always obvious what tasks can be run in parallel, so occasions for multithreading go unexploited simply because nobody identified them. New compilers were supposed to find and exploit these hidden opportunities, but they don’t do much better.
In parallel with all of that effort, chipmakers built hardware to detect and extract small-scale parallelism at run time. If two instructions that don’t seem to rely on each other (that is, that have no interdependencies) are fetched out of cache, the chip can safely execute both at once, assuming it has sufficient hardware resources to do so. Thus, we have many high-end processor chips with multiple adders, multiple shifters, multiple multipliers, and so on. Most of the time, that extra hardware is idle – wasted, in a sense. But it’s standing by in case it’s needed.
Finally, there’s the human factor. Programmers just don’t write parallel programs. Maybe it’s the languages we grew up learning (C, C++, assembly code, whatever), or maybe it’s our brains, but we just don’t seem able to write very much multi-threaded code. And if we can’t do it, what hope do our machines have?
More to the point, if a software compiler can’t do it, how can a chip? After all, compilers have all the time in the world to analyze the source code, examine data structures, model control flow, query the Internet, ask the programmer for hints, and so on. Chips have only a few nanoseconds to work, and a window of only a very few instructions to work with. Hardware has no idea about the big picture, while software can refactor code, break apart functions, or reorder data structures. It’s a lopsided battle.
And yet the hardware is winning. Chips have gotten considerably faster over the years, while compiler technology has remained fairly stagnant in comparison. Our programs get faster because we run them on faster processors, not because we recompiled. Sure, it costs a lot in “wasted” transistors, but transistors are essentially free anyway. (We covered the perverse economics of semiconductor manufacturing back in 2009.) Unless you’re overly concerned about either die area or leakage current, throwing transistors at the problem is the right thing to do.
As racing drivers sometimes say, it ran great right up until it didn’t. Leakage current, die area, and power consumption now are a big concern for a lot of chip designers and system developers. So how do we get that performance boost without all the wasted transistors, duplicate execution units, and on-the-fly parallelism?
Ah, that’s a secret. But Soft Machines has divulged a lot of the details, which we’ll dive into next week.
Another way to boost performance is to reduce cycles by extracting the control flow of the source code and storing variables in local dual port memory for parallel access and access in parallel with control state.
Put another way, eliminate the compile to an ISA.