“And now for something completely different.” – John Cleese
In its latest bid for world domination, ARM has announced two new high-end Cortex processor designs. One is a nice, but unremarkable, tweak on the current Cortex-A77 core. The other is something different: an unabashed speed freak that drops any pretense of power efficiency in pursuit of performance. Say hello to the Cortex-A78 and Cortex-X1, respectively.
ARM long ago developed a reputation for making power-efficient processors that struck a balance between performance output and energy consumption. Personally, I think that ARM’s low-power reputation was undeserved. There’s nothing magical about ARM’s CPU microarchitecture, circuit design, or instruction set that makes it unusually efficient, and other CPU manufacturers have matched or beaten ARM’s MIPS-per-watt metrics. The laws of physics are the same for everyone. Still, ARM’s brand positioning has served it well, to the point where the company has a near-monopoly on CPUs for portable battery-powered devices that cost more than a few bucks.
But, like a young actress who wants to branch out into edgier roles, ARM’s low-power image has held it back. The company shares its CPU designs among multiple licensees, so they need to be somewhat generic, portable, and broadly appealing. That limits the kinds of tweaks ARM can apply. Turn the dial too far toward aggressive performance, and some licensees will shriek and clutch at their pearls, fearful for the health of their batteries. Turn it too far the other way and you get a slow, boring CPU.
Now that virtually every SoC designer that can fog a mirror has already licensed some version of ARM’s processors, there was bound to be conflict. After all, if everyone is using the same Cortex core, how can they differentiate? Certainly not in the hardware.
Until now, the only recourse available to licensees was the rare and hideously expensive “architectural license” from ARM. This allowed large companies (e.g., Apple) to design their own ARM-compatible CPU cores any way they liked – just so long as it remained 100% software-compatible with the rest of ARM’s standard cores. Architectural licenses allow well-funded companies to make ARM cores exactly the way they want them: fast, slow, big, small, efficient, whatever. Plus, you don’t have to share your proprietary design with anyone else.
Enter Cortex-X1, ARM’s first semi-custom CPU design. It joins ARM’s public product catalog, but it’s less generic and likely to be less popular than its standard A-series cores. ARM says it designed the X1 in collaboration with some unnamed customers who wanted more performance and were willing to sacrifice power and silicon to get it. The X1 is Cortex with the speed governor removed.
The X1 is just the first of a planned series of “performance-first” CPU designs. Like Mercedes-Benz’s AMG series of cars or Honda’s red-badged R-spec models, Cortex-X is ARM’s performance sub-brand. Interested licensees first need to join (and pay for) ARM’s Cortex-X Custom (CXC) Program. That opens the door to collaborating with ARM to specify performance-oriented versions of existing Cortex-A cores with the design tradeoffs you want. More execution units? Bigger caches? Increased bandwidth? They’re all on the table.
Unlike with an architectural license, ARM still does all the design work. Licensees get to have input, but they don’t get to (or have to) design their own processor in-house, nor are the X cores proprietary to that licensee. Crucially, X cores remain ARM property and are licensed only to the CXC member(s). The hoi polloi outside the velvet rope don’t get access to them.
In the case of the X1, it encapsulates a number of adjustments to improve integer performance by 30% over the top-of-the-line Cortex-A77, or 22% over the just-announced Cortex-A78 (more on which anon). Naturally, those numbers are estimates based on simulations. ARM is comparing “peak” performance (not sustained) on a single core based on SPECint_base. It’s safe to say that the X1 will be faster than the A77 in most or all cases, but how much faster depends on a lot of factors. YMMV.
The X1 gets there, not through any radical redesign of the Cortex microarchitecture, but through a series of small upgrades, adjustments, and tradeoffs. It fetches up to five instructions per cycle, with correspondingly greater fetch bandwidth, compared to four/cycle for the A77/A78. Its L1, L2, and L3 caches are all bigger. (Or, they can be; actual size is configurable.) The L0 branch-target buffer (BTB) is 50% larger than the A78 one, resulting in lighter penalties for mispredicted branches.
Like other high-end ARM cores, the X1 converts native ARM instructions into macro-operations (MOPs), and the X1 can dispatch up to 8 MOPs per cycle, compared to the 6 MOPs/cycle for A77/A78. The MOP cache is also bigger. The X1 is an out-of-order machine like its siblings, but its “window size” for reordering is 40% larger, at 224 entries, which should help it uncover more opportunistic parallelism. Finally, X1 has double the number of Neon vector units (four vs. two), which should make it better at both graphics and machine learning inference.
After normalizing for fab process, clock frequency, and other factors, but with different cache sizes, ARM says its X1 will outperform the A77 by 30% in both integer and floating-point benchmarks. It also does 19% better in the Java Octane benchmark. Those are simulations, of course, and the integer number is based on the SPECspeed integer test, a part of the larger SPEC CPU benchmark suite.
Almost all the performance-enhancing upgrades to the X1 are area-related. Bigger caches, larger BTB, wider buses, more execution units – these all take up space. And that’s kind of the point. With the area restraints loosened, ARM was able to unlock some more performance, at the cost of silicon area. In this sense, the X1 is just like any other CPU, regardless of processor family or architecture. As the old racer’s adage goes, speed costs money. How fast do you want to go?
ARM says the X1 is about 50% bigger than an A77, though much of that difference is the extra 512KB of L2 cache. The X1 also plays nicely within a cluster of other Cortex-A processors. Comparing apples to oranges for a moment, a cluster of X1, A78, and A55 cores with 8MB of L3 cache done in a 5nm process will be about 15% larger than a cluster of A77 and A55 cores with 4MB of L3 in 7nm. So now you know.
A 30% (peak, projected) performance improvement for a 15% (projected) cost in space seems like a good tradeoff. At least, if you’re not optimizing for battery life. ARM offered no estimates at all regarding Cortex-X1’s power consumption.
Speaking of the Cortex-A78, that new core was announced alongside Cortex-X1. As expected, it’s an upgrade from the A76 and A77, with some fairly predictable enhancements. Bottom line, the A78 is touted as being 20% faster than its A77 predecessor (simulated, projected, conditions apply, etc.).
Confusingly, ARM says the A78 is speedier than the A77 while remaining “in the same power envelope,” which sounds like a completely fair and rational comparison – until you read the fine print. It assumes the new core uses cutting-edge 5nm process technology, while the A77 is made using 7nm silicon. Oh, and one core is running at 3.0 GHz while the other is at 2.6 GHz, a 15% accelerant for the winner. Those changes alone can account for most of the performance difference. Simply building the older CPU in the newer, faster, and more efficient process will improve its specs, too. Yes, you get more for your watt from the newer CPU design, but ARM can’t really take credit for the work done at TSMC or GlobalFoundries.
Unlike with the X1, the upgrades to the A78 were all done with an eye toward power- and space-efficiency. There are no huge caches, extra execution units, or massive internal buses. Instead, ARM tweaked the A78’s branch predictor to support two taken branches per cycle, which should eliminate a few extra pipeline bubbles. Its designers also made what the company calls a “generational improvement” (i.e., a tweak) to the branch-prediction algorithm. In a similar vein, the company also eked out some improvements to the instruction schedulers, register renaming, reorder buffers, and other hidden arcana. None of these affect silicon area or power consumption by any appreciable measure.
The load/store path has been upgraded, too, with an additional address-generation unit (AGU), double the store bandwidth, and twice the bandwidth to the L2 cache. The caches themselves can optionally be smaller than before, which some customers requested.
Overall, the Cortex-A78 is certainly an improvement over the A77, but not a big one. ARM is overplaying its hand a bit by conflating changes in process technology and clock speed with the enhancements to the CPU. Painting stripes on the hood doesn’t turn it into a race car.
The Cortex-X1, on the other hand, is a bigger deal. It represents a change in business strategy for ARM, albeit a small one. Heaven knows, the company has a lot of different CPU cores in its catalog already, plus some limited configurability (like cache sizes) on most of them. You’d think that would be enough to satisfy everyone, but apparently not. The X1 and its successors allow ARM to break out of its self-imposed box of providing only power-efficient processors. Now the company has given itself permission to make faster, hotter, and bigger CPUs that don’t quite fit the traditional mold. And by branding them differently, ARM preserves its carefully curated image. Cortex-A remains the “good girl” in the family, while X1 is the new “bad boy” that provides a little bit of excitement and recklessness. But only a little bit.
Do they really expect to dig themselves out of the hole by digging it deeper? Probably not — they are just counting on people INTUITIVELY believing — in the end more complexity is what sells.
AND ALSO FORGETTING THAT THE CONTRAPTION WILL SOMEDAY HAVE TO BE VERIFIED.
There are already 3 levels of caches, branch prediction out of order execution, an obscene number of pipeline stages, multiple I fetches, etc. That have not been quantified — just sold on intuitive speculation. HOG WASH!
Not sure what you’re ranting about. ARM’s latest CPU designs aren’t particularly complex compared to those from other CPU companies. Yes, this stuff is complicated but hardly impossible to verify. That’s why it’s called engineering. Complexity is part of the game. Otherwise we’d all still be using 8-bit 4004 processors. Or individual flip-flops. What alternative do you suggest?
Also, what hole?
Since a lowly FPGA accelerator can beat the pants off a cpu running at 10x clock speed … something can be said about heterogeneous processing. Compilers are using Abstract Syntax Trees so the flow is based on binary expressions and control flow becomes either do the next in sequence or do the target.
True dual port memory blocks can read 2 operands in parallel and the next sequential can be read along with the address of the not sequential. So what? It is pure nonsense to rely on caches and and main memory allocation when there is so much memory just laying around on an FPGA. Not to worry, the FPGA is just another chip so he same memory an be on an ASIC.
Read the addresses of the first 2 operands, read the 2 operands and the opcode, do the opcode while
fetching the next operand and opcode.
Similarly fetch the next sequential and the address of the non sequential on the same cycle.
And the “hole” is that you cannot count on probabilities that good things happen 95% of the time.
Like the next operand or operator will be in cache 95% of the time. Even so, you can only read one address at a time.
“Personally, I think that ARM’s low-power reputation was undeserved. There’s nothing magical about ARM’s CPU microarchitecture, circuit design, or instruction set that makes it unusually efficient, and other CPU manufacturers have matched or beaten ARM’s MIPS-per-watt metrics. The laws of physics are the same for everyone. ”
-Loved that one!!!