Centaur Produces Nine-Headed, Two-Species Monster

“A great man is… an inspiration and a prophecy.” – Robert Green Ingersoll

If a centaur is supposed to be a combination of two separate beasts, then nothing could make a more appropriate mascot for Centaur Technology’s latest microprocessor. The tiny Austin-based company has created a dual-DNA, nine-headed creature that runs both x86 server code and hugely accelerated machine-learning (ML) processes. It’s a match made in mythology.

The new chip is called simply CHA (which doesn’t stand for anything), and it’s probably due out sometime late this year. We’re not sure, because sightings of the elusive beast are rare, although confirmed. It’s real; it’s just not reproducing yet in any great numbers.

What CHA does, exactly, would confuse even Janus. On one side, it’s a fairly conventional x86 server processor with eight CPU cores running at 2.5 GHz. But on the other, it’s a massive 20-trillion-operations-per-second (20 TOPS) machine-learning processor that can outrun far more expensive hardware on MLPerf benchmarks. Creaky old CISC architecture, meet your new workload.

For those still brushing up on their microprocessor mythology, Centaur Technology has been around longer than many of us. The company was formed back in 1995 during the x86 clone wars. Like NexGen, Rise, Cyrix, Transmeta, and others too numerous to mention, Centaur thought it could produce clean-room x86 designs that were fully compatible with Intel’s (and AMD’s) processors, hoping they could chip off a tiny sliver of the obscenely profitable PC processor market. The difference is, Centaur was right. Its chips worked as advertised, and the company has been cranking out newer and better x86 chips for 25 years. For most of that time, the company has been wholly owned by Via Technology in Taiwan, but it effectively still runs as a small and independent startup.

Now the company has taken on a new challenge: accelerating ML inference code. It’s still early days for ML hardware – not unlike the x86 market in the mid-90s – so there’s no consensus on how (or why) to build ML processors. Centaur chose a dual-pronged approach by combining the familiar x86 architecture with a completely new and original SIMD architecture boasting 128-bit instructions and massive 4096-byte (not bits) data paths. It’s two processors in one, intended for enterprise-level installations that are space- and cost-constrained. It’s neither the fastest x86 processor you can buy nor the fastest ML accelerator, but it is apparently the only device to combine the two.

First, the x86 part. This half of the chip (which, in reality, consumes about two-thirds of the CHA chip’s silicon die area) sports eight identical x86 processor cores, each with their own private L1 and L2 caches, and 16MB of shared L3 cache. All eight cores support Intel’s AVX-512 extensions for vector operations. The chip also includes north/southbridge features like four DDR4 channels, 44 PCIe 3.0 lanes, the usual assortment of miscellaneous PC I/O, and a port to a second CHA chip socket in case you want to build a two-chip system.

Next, the ML accelerator. Centaur calls it Ncore, and it’s a clean-sheet design organized as a long SIMD pipeline with a limited instruction set and very wide registers, data buses, and memories. Designers may argue over the best way to design an inference engine, but everyone agrees on one thing: machine learning consumes huge gobs of data.

ML workloads are broadly like those in signal processing or graphics, in that they’re (a) repetitive, (b) highly parallel, and (c) require lots of data transfers. In fact, feeding data to an inference engine is one of the toughest parts, just as it is with many DSP or GPU tasks.

Unlike a “normal” RISC or CISC processor, feeding the Ncore pipeline with instructions isn’t all that hard because the code is highly iterative (i.e., it’s mostly loops). In fact, Ncore’s code memory is only 512 words long, and even that is divided into halves, with Ncore executing out of one half while the programmer (that’s you) loads the other half. Its data memory, on the other hand, spans 16MB and consumes about two-thirds of Ncore’s total area.

That data memory is mapped onto the chip’s internal PCI bus during enumeration, so it appears to the x86 processors as a single 16MB block of RAM. All Ncore programming is done through this memory-mapped window; you never access Ncore directly.

Like many DSPs and GPUs, Ncore is highly parallel and optimized for repetitive MAC (multiply-accumulate) operations working on a steady flow of new data. It can work on 8-bit integers, 16-bit integers, or BFloat16, a popular format for machine learning that has a wider dynamic range than the “official” IEEE-754 standard FP format.

Because there’s so much data manipulation in ML workloads, Centaur built special-purpose hardware to rotate, shift, and otherwise massage data as it flows into and out of Ncore’s memory. Certain functions, like bit-wise rotation or replicating bytes to fill out a vector, are done on the fly without explicit coding. Output activation functions like rectified linear unit (ReLU), sigmoid, and hyperbolic tangent are also done by dedicated hardware and don’t take any time.

Based on published MLPerf benchmark results, Centaur’s CHA kicks some AI butt. It runs circles around Intel’s best Xeon processors, and it beats out chips from nVidia, Qualcomm, and cloud-based services from Google and Alibaba. It’s not the fastest overall, but Centaur claims it’ll offer the best price/performance. A CHA-based system will also be smaller, since other x86 solutions require external accelerator cards and/or FPGA add-ons. Nobody else offers an x86 processor and ML accelerator all in one package. That might be a big deal to customers creating space-constrained 1U rackmount systems.

Besides, there’s nothing to prevent you from adding off-chip hardware alongside CHA, since the chip behaves like any other x86 processor. The Ncore coprocessor inside doesn’t preclude the use of additional hardware acceleration outside.

Downsides? There may be a few. If you look at it as just an x86 server processor, CHA is woefully out of date. Its eight cores are all single-threaded, compared to the dual-threaded processors from Intel and AMD. Its 2.5GHz clock rate is second-rate, too. Centaur describes its x86 design as a cross between Haswell and Skylake, which is pretty accurate in terms of ISA features, but Haswell is seven years old and Skylake is five years old. By the time CHA starts shipping late this year, Intel will have moved on to Sunny Cove, putting CHA yet another generation behind the state of the art. Designing a multicore x86 processor from scratch with a small team of engineers is a remarkable achievement, but even so, this one’s not close to current performance norms. CHA’s value is in the ML accelerator, not the eight-core x86 chip.

Ancient centaurs were supposed to be gifted in prophecy. Maybe the modern Centaur, with its hybrid CHA, sees something coming that its competitors don’t – yet.

Centaur Produces Nine-Headed, Two-Species Monster

Related

Leave a Reply Cancel reply

featured paper

How Google and Intel use Calibre DesignEnhancer to reduce IR drop and improve reliability

featured chalk talk