feature article
Subscribe Now

Itanium Deathwatch Finally Over

Intel’s Itanium Receives its Official Death Warrant

“You miss 100% of the shots you don’t take.” – Wayne Gretzky

It’s not as though we didn’t see this coming, but it’s still a bittersweet moment. Intel’s Itanium, which has been on life support for years, was just given its official end-of-life (EOL) papers. Go ahead and engrave the date of January 30, 2020 on its massive silicon tombstone.

It’s easy with hindsight to poke fun at Intel and Hewlett Packard for creating Itanium in the first place. Armchair quarterbacks from every quarter (myself included) called it the “Itanic” and gleefully reported every hit below the waterline leading to its slow-motion sinking.

But you know what? I don’t want to do that. Sure, Itanium was an enormously expensive and embarrassing failure for both companies, but it was also a big, hairy, audacious endeavor. We need more of those. Good for Intel. Good for Hewlett Packard (now HPE). Bummer that your moonshot project didn’t reach its goal. But it didn’t blow up on the launchpad, either, and Intel, HPE, and we all learned something in the process. As the motivational posters say, you’re not failing so long as you’re learning.

Itanium was created for a lot of reasons, chief among them a desire to leapfrog the x86 in performance and sophistication. Even back in the era of Napster and LiveJournal, the x86 architecture was looking mighty tired. Intel and HPE both needed something better to replace it.

Fortunately, there was no shortage of better ideas. RISC, VLIW, massive pipelining, speculative execution, out-of-order dispatch, compiler optimization, big register sets, data preloading, shadow registers, commit buffers, multilevel caches, and all the other tricks of the CPU trade were surfacing around the same time. Pick any three and create your own CPU! It’ll be faster, cooler, and more academically stimulating than anything else out there. In the 1990s, it was hard not to design a new CPU architecture.

One underlying philosophy behind Itanium (and many other CPU children of the ’90s) was that software is smarter than hardware. Seems simple enough. Have you ever seen inside the branch-prediction logic of a modern CPU? We throw hundreds, then thousands, then millions of transistors at the task of flipping a coin. Will this branch be taken or not taken? Circuitry has just a few nanoseconds to decide.

How much simpler it would be to shift that task to the software. Compared to complex hardware in the critical path, a compiler has an infinite amount of time to deliberate. Compilers can see the whole program at once, not just a tiny runtime window. Compilers can take hints provided by the programmer. Compilers and analysis tools can model program flow and locate bottlenecks. And best of all, compilers are easier than hardware to change, improve, and update.

The same theology applies to parallelism. Hardware struggles to eke out a bit of parallelism where it can. But software can see the whole picture. Software can schedule loads, stores, arithmetic operations, branches, and the whole gamut of instructions for optimal performance. Software can sidestep bottlenecks before they even happen. It’s ludicrous to force runtime hardware to thread that needle when the compiler can do so at its leisure.

Yup, that settles it: we’re doing our optimization in software from now on. Pull that stuff out of the hardware’s critical path and fire up the compiler tools. Let me know when that new optimizing compiler is ready.

We’re still waiting.

And therein lies the problem. The compilers for Itanium never got good enough to deliver the leap in performance that we all just knew was there for the taking. C’mon, where’s my factor-of-ten performance jump?

It’s not coming from the compilers, that’s for sure, nor is it lurking in VLIW, EPIC, or superscalar hardware tricks. Yes, compilers have all the time in the world (compared to runtime hardware) to tease out hazards like data dependencies, load/use penalties, branch probabilities, and other details. And yes, the compiler can see more of the program than hardware can. The compiler does know more than the hardware, but it doesn’t know much more.

Some things are unknowable, and compilers aren’t omniscient. Even with the entire program to analyze, most branch prediction comes down to an educated coin toss. Even with all the source code as its disposal, finding parallelism is tricky beyond a small window of instructions, and the hardware can already do that.

The plan to throw the tough problems at the compiler guys was doomed from the start. Itanium’s compiler writers weren’t slacking off or underperforming. They didn’t need just a little more time, or just one more release. They were saddled with impossibly high expectations.  

Turns out, that convoluted branch-prediction hardware was already doing about as good as job as it’s possible to do. Sure, you can shift that task from hardware to software, but that’s just an implementation tradeoff. There’s no big gain to be had.

Same goes for wide and deep register sets, or bigger caches, or wide instruction words. Itanium bundled instructions and executed them in parallel where possible, through a combination of compiler directives and runtime hardware. Surely the software-directed parallelism will yield big results? Nope. Itanium’s compilers can format beautifully dense and efficient instruction blocks – but only if the program lends itself to such solutions. Apparently, few real-world programs do.

Data dependencies and load/use penalties are just as hard to predict in software as they are in hardware. Will the next instruction use the data from that previous one? Dunno; depends on the value, which isn’t known until runtime. Can the CPU “hoist” the load from memory to save time? Dunno; it depends on where the data is stored. Some things aren’t knowable until runtime, where hardware knows more than even the smartest compiler.

Itanium is like a rocket-powered Hot Wheels car running on its orange plastic track. It had (sorry, still has) awesome power, vast resources, and elaborate control systems. It’s just criminally hampered by its environment. It can do cool loops if it gets a running start but is otherwise stuck on its narrow track.

To borrow a baseball analogy, if you never swing the bat, you’ll never hit the ball. Sometimes you strike out. There’s no shame in that. If you don’t, you’re not trying hard enough, and Intel and HPE were certainly trying hard with Itanium. So long, Itanium, and good luck to its creators.

4 thoughts on “Itanium Deathwatch Finally Over”

  1. Still have over a dozen of these dual core quad processor, huge cache, computational servers in our cluster that are a decade+ old. It was worth updating all them to dual core a few years back, just because for some problems they beat other computational servers in our cluster hands down. Some are Intel built, most are Dell built using the same reference design.

    They are hot though, and do spin the power meter.

  2. I remember having to deal with PA-RISC, I didn’t like it. Someone said it only survived because the big cache gave it enough performance. I think those folks moved on to Itanium and it suffered a similar problems – the base architecture was flawed and irrecoverable (according to a compiler writer friend).

    Hardware guys can do stuff that is pretty cool and should work, but software engineers writing compilers have a different mindset, and if the two aren’t in sync you are out of luck.

    SPARC & Solaris, were infinitely preferable to PA-RISC & HPUX, and I never got close to Itanium in any form – and God knows people like to torture me with stupid processors…

    I think the main takeaway is that Intel will keep drinking their Kool-Aid well past it’s sell-buy date, and refuse to acknowledge things are failing. Hard to sell them new solutions if they don’t admit they’re in a hole…

    1. Agreed. Although Itanium lasted ~20 years, I think the outcome was clear after the first ten. The rest was a long, slow, glide path. It’s tough to decide whether to “fish or cut bait” when there’s that much money on the line.

  3. Well the good news is there will be some really cheap 9760 HPE NUMA clusters hitting the surplus fire sales soon, that will be a lot faster than our 9300’s.

Leave a Reply

featured blogs
Nov 22, 2024
We're providing every session and keynote from Works With 2024 on-demand. It's the only place wireless IoT developers can access hands-on training for free....
Nov 22, 2024
I just saw a video on YouTube'”it's a few very funny minutes from a show by an engineer who transitioned into being a comedian...

featured video

Introducing FPGAi – Innovations Unlocked by AI-enabled FPGAs

Sponsored by Intel

Altera Innovators Day presentation by Ilya Ganusov showing the advantages of FPGAs for implementing AI-based Systems. See additional videos on AI and other Altera Innovators Day in Altera’s YouTube channel playlists.

Learn more about FPGAs for Artificial Intelligence here

featured paper

Quantized Neural Networks for FPGA Inference

Sponsored by Intel

Implementing a low precision network in FPGA hardware for efficient inferencing provides numerous advantages when it comes to meeting demanding specifications. The increased flexibility allows optimization of throughput, overall power consumption, resource usage, device size, TOPs/watt, and deterministic latency. These are important benefits where scaling and efficiency are inherent requirements of the application.

Click to read more

featured chalk talk

Advanced Gate Drive for Motor Control
Sponsored by Infineon
Passing EMC testing, reducing power dissipation, and mitigating supply chain issues are crucial design concerns to keep in mind when it comes to motor control applications. In this episode of Chalk Talk, Amelia Dalton and Rick Browarski from Infineon explore the role that MOSFETs play in motor control design, the value that adaptive MOSFET control can have for motor control designs, and how Infineon can help you jump start your next motor control design.
Feb 6, 2024
54,981 views