Still trying to juggle those flaming chainsaws? Splendid, because now we’re going to see how it’s done.
Last week we introduced Soft Machines and its VISC processor, a new CPU design that runs native ARM code even though it’s not an ARM processor. Soft Machines says VISC can also be tailored to run x86 code, Java code, or just about anything else the company decides is worthwhile. It’s a tabula rasa microprocessor: able to run just about anything you throw at it.
Its other major trick is that it can extract more single-thread performance out of a given binary program than any other CPU. And do so without expending a horrendous number of transistors or consuming planetary levels of energy. Let’s start with that part.
VISC is a multicore processor, which means it’s got two (or more) identical CPU cores running side by side. No surprises there. In “normal” multicore processors, like we see from Intel, AMD, ARM, MIPS, and others, the two cores operate independently, with one core running one instruction stream, or thread, and the other core(s) running a different thread. If the particular program you’re running can’t be split into threads, one core sits momentarily idle. But the one-to-one correlation between threads and cores is hardwired into most multicore processors.
VISC works a bit differently. Although there are two identical cores (in the initial implementation; future versions will have four cores), there’s no one-to-one correlation between threads and cores. One thread might run on both cores, “borrowing” resources such as an adder or a multiplier, from its neighboring core. This allows a complex thread to briefly spread itself across all the resources the chip can offer, in order to execute in the minimum amount of time.
The leftover resources from the second core needn’t go to waste, either. If one thread uses, say, one-and-a-half cores (as in the above example), the remaining “half core” can execute another thread. This permits a much more fine-grained use of resources than other multicore processors offer, which means more of the chip’s energy goes into useful work and less into sitting idle.
Part of the smarts to enable this lies in the CPU core, and part appears earlier in the pipeline, in a feature that Soft Machines simply calls the Global Front End. The GFE fetches code from VISC’s unified instruction cache and starts the process of looking for parallelism and dependencies. It does this largely by a process of elimination, looking for instructions that are interdependent because they use the same registers, depend upon each other’s output, or reference the same pointers, for example. The idea here is to encapsulate data references within a single core. What you don’t want is code in Core A using the same registers as the thread in Core B.
Microprocessor aficionados will be familiar with the concepts of register renaming and out-of-order execution. VISC does both. Although the chip has a set of software-visible registers (namely, the ARM register set in the initial implementation), it really has a completely different internal register set. Each core gets its own registers, and the Global Front End handles the register renaming as each thread makes its way in/out of the core.
Once a batch of interdependent instructions is dispatched to one core or the other, the core itself works on reordering them. This low-level reordering is unique among current processors. Normally, that work is all done up front, and the cores simply do what they’re told. VISC has distributed that intelligence, allowing the Global Front End to make the first-level decisions about dependencies and threads, while delegating the reordering of operations to the cores themselves.
Because instruction reordering implies speculative execution, VISC must withhold posting the results of any instruction until the results of all the previous instructions are resolved. This is particularly important following a branch, where all of the operations must be abandoned while execution resumes at the branch target. In this case, VISC isn’t much different from other speculative processors (think x86). There’s just no way around branches.
So where does the ARM emulation come in? That also happens in the Global Front End, where it adds a couple of stages to VISC’s 11-stage pipeline. All translation is done automatically; there’s no special compiler, preprocessor, or emulator code required. This is unlike what Apple and other computer companies did in the 1990s when they translated binary code on the fly, and more akin to Transmeta, Intel, or AMD. The x86 instruction set is notoriously baroque and intricate, so breaking it down into more-digestible micro-operations made sense for the x86 vendors. But ARM’s instruction set is pretty straightforward, as these things go. Whatever VISC’s internal instruction set looks like (the company isn’t saying), it’s probably not a tough job to convert from one to the other.
The hard part is beating off the lawyers. Companies like PicoTurbo as well as enterprising university students have all successfully cloned ARM processors before, but they were subsequently litigated out of existence. Soft Machines plans to skirt that legal minefield by licensing VISC only to companies that already have an ARM license. Transmeta, and even AMD, have used a similar approach by sticking to foundries covered under Intel’s patent agreements, so there is some legal precedent for the strategy. But that begs the question, why would I license VISC if I’ve already licensed ARM?
Probably for the performance. In dozens of benchmark tests, VISC handily outperformed an ARM Cortex-15 in every category – often by more than 2x or 3x. That’s double or triple the performance of ARM’s best 32-bit processor, running unmodified ARM binaries. No mean feat, that.
Granted, the benchmarks were run by Soft Machines, but they include such reliable old standbys as SPECint, SPECfp, EEMBC, and other fairly tamper-proof tests of processing prowess. The absolute worst that VISC did was 1.1x the performance of an A15 (i.e., 10% faster). In the best case, it was nearly 7x faster. The miserable Dhrystone benchmark came in at over 4.5x the speed of ARM, and the un-weighted average of all the benchmarks hovers around 3x ARM’s average performance. Not bad; not bad at all.
Now for the asterisks. All of these scores are measured in units of performance per MHz, not absolute performance. VISC might be faster than Cortex-A15 in terms of instructions per clock (IPC), but if you’re running your current chip above about 500 MHz, it’s probably faster than VISC.
VISC may be considerably faster than ARM, or even x86, in terms of “architectural efficiency,” but it has a shorter pipeline than most leading-edge processors, and short pipelines hobble clock rates. Soft Machines isn’t saying how fast VISC will run in real life, but 500 MHz is a good guess, assuming a leading-edge process. That compares to 2.5 GHz for ARM’s Cortex-A15 and as much as 4 GHz for Intel’s 22nm Core i7 (Haswell) chips. If absolute performance is what you crave, VISC probably isn’t going to get you there.
ARM and Intel aren’t stupid; they know that short pipelines are simpler, use less silicon, and consume less power, but that longer pipelines enable faster clock frequencies. For today’s high-end embedded applications, that’s the right decision to make, and it’s why Cortex-A15 licenses are flying off the figurative shelves. Soft Machines chose simplicity (relatively speaking) over absolute performance, at least for now.
But that shouldn’t diminish Soft Machines’ accomplishments, and they are many. Even allowing for a modest “fudge factor” on the benchmarks, VISC soundly trounces its presumed archrival, ARM’s Cortex-A15. Double or triple the A15’s performance at the same clock rate would make any chip designer sit up and take notice.
And that’s not even counting VISC’s expected power savings. Using SPECint as a yardstick, Soft Machines says a two-core VISC can deliver the same performance as A15 while using just one-third the power. Conversely, a four-core VISC can crank out the roughly double the performance at the same power level. Regardless the where the actual numbers fall, the point is clear: VISC is on a shallower performance-per-watt curve than even the vaunted ARM family. It looks like someone has beat ARM at its own game, at least on paper.
So why would Soft Machines paint a target on itself by competing directly with ARM with an ARM-compatible processor? To paraphrase Willie Sutton, that’s where the customers are.
Inventing a new microprocessor is hard enough, but the really hard part is developing software for it. As many large chipmakers discovered to their detriment (Intel, AMD, Freescale, Texas instruments, IBM, DEC, et al), it takes a minor miracle for a new processor to develop enough market momentum to become viable. Without a critical mass of software, including compilers, operating systems, middleware, applications, and much more, a processor is just an interesting circuit-design experiment. So by adopting (usurping, even) ARM’s established software base, Soft Machines made its task a whole lot easier. “We can’t give performance with one hand and take it away with the other,” says the company’s CTO. All of that clever circuit design might have gone to waste if they’d given VISC an entirely new instruction set. Better to go with the flow.
As we said last week, you don’t see entirely new microprocessor companies every day. But this one might just be worth watching.
If you were in charge of mergers & acquisitions at your company, would you acquire Soft Machines and its VISC technology?
Why or why not?
Not, and here’s why not:
1) The focus is on computation intensive algorithms rather than general purpose.
2) Excessive memory accesses due to speculative execution and RISC ISA.
3) Any new cpu should run at HLL statement abstraction level rather than an ISA.