“You miss 100% of the shots you don’t take.” – Wayne Gretzky
It’s not as though we didn’t see this coming, but it’s still a bittersweet moment. Intel’s Itanium, which has been on life support for years, was just given its official end-of-life (EOL) papers. Go ahead and engrave the date of January 30, 2020 on its massive silicon tombstone.
It’s easy with hindsight to poke fun at Intel and Hewlett Packard for creating Itanium in the first place. Armchair quarterbacks from every quarter (myself included) called it the “Itanic” and gleefully reported every hit below the waterline leading to its slow-motion sinking.
But you know what? I don’t want to do that. Sure, Itanium was an enormously expensive and embarrassing failure for both companies, but it was also a big, hairy, audacious endeavor. We need more of those. Good for Intel. Good for Hewlett Packard (now HPE). Bummer that your moonshot project didn’t reach its goal. But it didn’t blow up on the launchpad, either, and Intel, HPE, and we all learned something in the process. As the motivational posters say, you’re not failing so long as you’re learning.
Itanium was created for a lot of reasons, chief among them a desire to leapfrog the x86 in performance and sophistication. Even back in the era of Napster and LiveJournal, the x86 architecture was looking mighty tired. Intel and HPE both needed something better to replace it.
Fortunately, there was no shortage of better ideas. RISC, VLIW, massive pipelining, speculative execution, out-of-order dispatch, compiler optimization, big register sets, data preloading, shadow registers, commit buffers, multilevel caches, and all the other tricks of the CPU trade were surfacing around the same time. Pick any three and create your own CPU! It’ll be faster, cooler, and more academically stimulating than anything else out there. In the 1990s, it was hard not to design a new CPU architecture.
One underlying philosophy behind Itanium (and many other CPU children of the ’90s) was that software is smarter than hardware. Seems simple enough. Have you ever seen inside the branch-prediction logic of a modern CPU? We throw hundreds, then thousands, then millions of transistors at the task of flipping a coin. Will this branch be taken or not taken? Circuitry has just a few nanoseconds to decide.
How much simpler it would be to shift that task to the software. Compared to complex hardware in the critical path, a compiler has an infinite amount of time to deliberate. Compilers can see the whole program at once, not just a tiny runtime window. Compilers can take hints provided by the programmer. Compilers and analysis tools can model program flow and locate bottlenecks. And best of all, compilers are easier than hardware to change, improve, and update.
The same theology applies to parallelism. Hardware struggles to eke out a bit of parallelism where it can. But software can see the whole picture. Software can schedule loads, stores, arithmetic operations, branches, and the whole gamut of instructions for optimal performance. Software can sidestep bottlenecks before they even happen. It’s ludicrous to force runtime hardware to thread that needle when the compiler can do so at its leisure.
Yup, that settles it: we’re doing our optimization in software from now on. Pull that stuff out of the hardware’s critical path and fire up the compiler tools. Let me know when that new optimizing compiler is ready.
We’re still waiting.
And therein lies the problem. The compilers for Itanium never got good enough to deliver the leap in performance that we all just knew was there for the taking. C’mon, where’s my factor-of-ten performance jump?
It’s not coming from the compilers, that’s for sure, nor is it lurking in VLIW, EPIC, or superscalar hardware tricks. Yes, compilers have all the time in the world (compared to runtime hardware) to tease out hazards like data dependencies, load/use penalties, branch probabilities, and other details. And yes, the compiler can see more of the program than hardware can. The compiler does know more than the hardware, but it doesn’t know much more.
Some things are unknowable, and compilers aren’t omniscient. Even with the entire program to analyze, most branch prediction comes down to an educated coin toss. Even with all the source code as its disposal, finding parallelism is tricky beyond a small window of instructions, and the hardware can already do that.
The plan to throw the tough problems at the compiler guys was doomed from the start. Itanium’s compiler writers weren’t slacking off or underperforming. They didn’t need just a little more time, or just one more release. They were saddled with impossibly high expectations.
Turns out, that convoluted branch-prediction hardware was already doing about as good as job as it’s possible to do. Sure, you can shift that task from hardware to software, but that’s just an implementation tradeoff. There’s no big gain to be had.
Same goes for wide and deep register sets, or bigger caches, or wide instruction words. Itanium bundled instructions and executed them in parallel where possible, through a combination of compiler directives and runtime hardware. Surely the software-directed parallelism will yield big results? Nope. Itanium’s compilers can format beautifully dense and efficient instruction blocks – but only if the program lends itself to such solutions. Apparently, few real-world programs do.
Data dependencies and load/use penalties are just as hard to predict in software as they are in hardware. Will the next instruction use the data from that previous one? Dunno; depends on the value, which isn’t known until runtime. Can the CPU “hoist” the load from memory to save time? Dunno; it depends on where the data is stored. Some things aren’t knowable until runtime, where hardware knows more than even the smartest compiler.
Itanium is like a rocket-powered Hot Wheels car running on its orange plastic track. It had (sorry, still has) awesome power, vast resources, and elaborate control systems. It’s just criminally hampered by its environment. It can do cool loops if it gets a running start but is otherwise stuck on its narrow track.
To borrow a baseball analogy, if you never swing the bat, you’ll never hit the ball. Sometimes you strike out. There’s no shame in that. If you don’t, you’re not trying hard enough, and Intel and HPE were certainly trying hard with Itanium. So long, Itanium, and good luck to its creators.
Still have over a dozen of these dual core quad processor, huge cache, computational servers in our cluster that are a decade+ old. It was worth updating all them to dual core a few years back, just because for some problems they beat other computational servers in our cluster hands down. Some are Intel built, most are Dell built using the same reference design.
They are hot though, and do spin the power meter.
I remember having to deal with PA-RISC, I didn’t like it. Someone said it only survived because the big cache gave it enough performance. I think those folks moved on to Itanium and it suffered a similar problems – the base architecture was flawed and irrecoverable (according to a compiler writer friend).
Hardware guys can do stuff that is pretty cool and should work, but software engineers writing compilers have a different mindset, and if the two aren’t in sync you are out of luck.
SPARC & Solaris, were infinitely preferable to PA-RISC & HPUX, and I never got close to Itanium in any form – and God knows people like to torture me with stupid processors…
I think the main takeaway is that Intel will keep drinking their Kool-Aid well past it’s sell-buy date, and refuse to acknowledge things are failing. Hard to sell them new solutions if they don’t admit they’re in a hole…
Agreed. Although Itanium lasted ~20 years, I think the outcome was clear after the first ten. The rest was a long, slow, glide path. It’s tough to decide whether to “fish or cut bait” when there’s that much money on the line.
Well the good news is there will be some really cheap 9760 HPE NUMA clusters hitting the surplus fire sales soon, that will be a lot faster than our 9300’s.