You know who you are. You’re one of the legions of ARM programmers, engineers, and developers. You made ARM the most popular 32-bit processor on the planet—eclipsing even Intel. You use an ARM-based cell phone, you listen to your ARM-based iPod, you spin up ARM-based disk drives… admit it. You’re part of the ARM army.
Well, good news, campers. The latest, greatest, fastest, most wonderful-est ARM processor in the world just got announced today. It’s the tippy-top of ARM’s broad family tree, surpassing even the multicore Cortex-A9. Behold the Cortex-A15. Look upon it and be amazed.
Okay, maybe the A15 isn’t that big a deal. Yes, it’s a sophisticated and advanced 32-bit processor design, and it’s clearly the best work that ARM has ever done. But to be honest… it’s a lot like other 32-bit designs from other CPU companies. The big deal is that it’s the most-advanced CPU from ARM. It’s just not the most-advanced CPU ever.
What’s She Got Under the Hood?
By any measure, the Cortex-A15 (which was code-named Eagle in development) is an impressive piece of work. It’s a multicore, superscalar, out-of-order, 32-bit machine with extensive branch prediction, virtualization, register renaming, parallel execution units, and all the other bells and whistles you could want. When the first A15-based chips hit the street next year, they should hum along at about 1.5GHz and may hit 2.5GHz with a bit of a tailwind. The A15 can easily support up to eight CPU cores, and ARM hints that support for 32 cores and more might be just around the corner. Clearly, this is a big processor for big tasks.
This is not your father’s low-power ARM processor. Cortex-A15 isn’t for cell phones or iPads. It’s intended to take on big server chips, network chips, and communications processors where Freescale, Intel, Cavium, NetLogic, Marvell, and other comms-related companies play today. Forget what you remember about the cute little ARM7. Cortex-A15 moves ARM into the world of big iron.
And that’s really a double-edged sword. The A15 looks like it has the performance wherewithal to duke it out with the big boys from MIPS or PowerPC or Intel. But the A15 also gives up (or at least, downplays) the traditional ARM advantages of low power, small die area, and simple programming. To make a big and powerful processor, ARM had to… make a big and powerful processor. Let me show you what I mean.
For starters, A15 has a massively long 24-stage pipeline. Fully half of that—12 pipeline stages—is just for fetching and decoding instructions. That’s a heck of a long time before instructions even start to execute. The execution units (there are eight of them) take 3–12 additional pipeline stages, for a total of up to 24. There are two “simple” execution units for basic ARM instructions, two for multimedia and floating-point instructions, two for loads and stores, one for multiplication and division, and one execution unit for handling branches. All of these can run in parallel, though it would be an unusual piece of code that kept them all busy at once.
The long pipeline is necessary to enable the high clock rates; high frequencies mean short periods, after all. And since cache memories aren’t getting much faster these days, it takes more cycles to fetch code and data from those sluggish SRAMs. The downside of a long pipeline is the penalty you pay every time you have to flush it and start over. In other words, branches derail this high-speed train.
Every branch that’s taken forces the A15 to flush and reload its long instruction pipeline. Nothing unusual about that; every processor in the world does the same thing. To help mitigate the problem, ARM built in a branch-target buffer (BTB) to trap and hold the first few instructions from the beginning (target) of the most recently encountered branches. What makes the A15 interesting is its new “micro BTB,” which is managed like a fully associative cache. Through the magic of dynamic branch prediction, the A15 basically guesses whether a branch will be taken or not, then looks up the target of that branch in its micro BTB. Assuming the guess is correct, a nasty pipeline bubble is all but prevented.
I Don’t Want To Be Alone
There’s plenty more going on inside the A15, but, rather than wallow in the nerdy details, let’s turn our attention to its clustering abilities. Like the Cortex-A9, the A15 can be fabricated in clusters of four CPUs. (You can also make one- and two-CPU clusters.) All four CPUs share the same L2 cache and thus maintain cache coherence.
Unlike with the A9, you can combine more than one of these four-way clusters in a single chip. For now, ARM admits the A15 will support two such clusters, for an eight-way processor. Realistically, the limit is probably around 32 cores or so. All the CPUs remain cache coherent with one another through a shared AMBA 4 interface. All the caches are fault-tolerant, too, courtesy of ECC (error checking and correction). The L1 and L2 caches will silently correct single-bit errors or squeal and complain if they detect two-bit errors. ECC is important if you’re making servers that are up and running 24/7 and might occasionally (in fact, will probably) encounter the sporadic “soft” error in RAM.
There’s still more to the Cortex-A15, too. There’s the register renaming that enables aggressive out-of-order execution. There’s the new privilege level that helps with virtualization. There’s the 40-bit physical addressing that allows software to access 1 TB of memory. The list goes on and on.
In short, the A15 is a big-boy, grownup processor. It’s got the whole checklist of high-performance features that a MIPS 1074K, PowerPC e600, or Intel Xeon has. It’s a he-man processor buffed and ready for some heavy lifting.
What it’s not is your traditional small and light ARM processor. ARM isn’t revealing the A15’s power numbers or die area yet, but I’m willing to bet it’s big and it’s hot. You can’t run a processor this complex without burning a lot of watts. ARM can’t sprinkle any magic pixie dust on its CPUs; they’re governed by the same laws of physics as everyone else’s. The company earned its low-power reputation by designing CPUs that were less complex than anyone else’s. They weren’t magic; they were simple. With the A15, the company joins the ranks of the other high-end processor vendors, watts and all.
It’s as though ARM has reached puberty. The company has grown up and earned its place at the adult table with the other grownups. But in so doing, it’s lost some of its youthful charm. The company that wrote the book on low-power licensed processor designs has steadily outgrown the characteristics that made it so appealing. It’s filled out and become an awkward teen, not sure whether it should be playing with toys or applying for its first job. Welcome to the complex world of adulthood.