Intel’s latest version of oneAPI takes advantage of new Intel Xeon improvements, supports AMD and Nvidia

In its quest to make oneAPI a viable alternative to Nvidia’s CUDA for parallel-processing software development, Intel has released the 2023.1 version of its oneAPI tools. Last August in EEJournal, I wrote:

“Nvidia has something that Intel and AMD covet. No, it’s not GPUs. Intel and AMD both make GPUs. However, they don’t have Nvidia’s not-so-secret weapon that’s a close GPU companion: CUDA, the parallel programming language that allows developers to harness GPUs to accelerate general-purpose (non-graphics) algorithms. Since its introduction in 2006, CUDA has become a tremendous and so-far unrivaled competitive advantage for Nvidia because it works with Nvidia GPUs, and only with Nvidia GPUs. Understandably, neither Intel nor AMD plan to let that competitive advantage go unchallenged.”

(See “Intel oneAPI and DPC++: One Programming Language to Rule Them All (CPUs, GPUs, FPGAs, etc)”)

James Reinders published a blog in April titled, “2023.1: Refining Intel® oneAPI 2023 Tools” that described many of the improvements made to this latest version of Intel oneAPI tools. Reinders retired from and then rejoined Intel and is author of the book titled “Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL.” According to Reinders’s blog, improvements to the Intel oneAPI toolkit include:

· Compiler support for automatically enabling bfloat16 (Brain Floating Point, 16-bits) when available. The bfloat16 format was developed by Google Brain, an artificial intelligence research group at Google, and is used in Google Tensor Processing Units (TPUs). The 16-bit format represents a wide dynamic range of numeric values using a truncated (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format. It’s become widely used for accelerating machine-learning (ML) algorithms. This format approximates the wide dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits but employs 8-bit precision instead of the 24-bit significand of IEEE 754. The bfloat16 format reduces storage requirements and speeds execution for ML applications. The 4^th Gen Intel Xeon Processor’s version of Advanced Matrix Extension (AMX) acceleration directly supports bfloat16 dot-product calculations.

· The 2023.1 oneAPI toolkit update supports new Codeplay oneAPI plugins for NVIDIA and AMD. (Intel acquired CodePlay last year.) The AMD plugin now works with AMD’s ROCm 5.x driver. ROCm is AMD’s own answer to Nvidia’s CUDA. This new plugin support reinforces Intel’s plan to make oneAPI the preferred alternative for heterogeneous, parallel programming. Last October, James Reinders at Intel had this to say about Intel’s acquisition of CodePlay: “The company Codeplay became available, and Intel decided to acquire them. I was thrilled. I’ve worked with the people at Codeplay and have loved working with them. They’ve been working on Nvidia and AMD GPUs for a while, but, as a commercial company, they were always looking for someone to underwrite their work. Will a customer want it? Some of the labs sometimes gave them seed money, but not enough to fully productize their work. I hesitate a little to say, “blank check,” but they essentially now have a blank check from Intel to productize their work, and they don’t need to worry about anyone else paying for it. You should see results from this acquisition later this year. You’ll see their tools integrate with Intel’s releases of SYCL so that SYCL/DPC++ ends up being able to target all GPUs from Intel, Nvidia, and AMD. People in the know could build this sort of software using open-source tools over the last year. But let’s face it, most of us want to be as lazy as we can be. I really like just being able to download a binary with a click, install it, and have it just work, instead of building it from open-source files and reading lots of instructions to turn the files into usable tools.” (See “Intel’s Gamble on oneAPI and DPC++ for Parallel Processing and Heterogeneous Computing: An Interview with Intel’s James Reinders.”)

The Intel VTune Profiler can automatically highlight profiles that improve performance by exploiting high-bandwidth memory (HBM) on the recently introduced Intel Xeon Processor Max Series, which the company introduced early this year. The Xeon CPU Max Series incorporates 64 gigabytes of HBM2e high-bandwidth DRAM in the package. With the release of this latest version of the oneAPI toolkit, Intel is making it easier for software developers to unlock the performance potential of that HBM2e memory.

This latest version of the oneAPI toolkit delivers performance increases for photorealistic ray tracing and path guiding from the Intel Open Path Guiding Library (integrated in Blender and Chaos V-Ray) on 4th gen Intel Xeon Processors.

The 2023.1 version of the oneAPI toolkit includes updates for the latest CUDA headers and libraries to help software developers migrate Nvidia’s CUDA code to SYCL, the heterogeneous, parallel-processing version of the C++ programming language developed by the Khronos Group. SYCL is the core programming language for the Intel oneAPI toolkit. Intel’s DPC++ is the company’s special flavor of SYCL.

The version of the oneAPI toolkit adds support for Intel Arc GPUs to the Intel Distribution for the GDB debugger on Windows. Last September, Intel acquired GPU specialist ArrayFire, a small team of four engineers who specialize in GPU software development. Intel has a lot riding on its Arc GPU family, and so this announcement at least shows some continued support for these GPUs. With the departure of the Arc GPU’s chief architect Rajah Koduri in March of this year, the future status of the company’s GPUs has been somewhat cloudy. This latest release of the oneAPI toolkit at least indicates some continued support.

Intel® MPI Library enhances performance for collectives using GPU buffers and default process pinning on CPUs with E-cores and P-cores on Intel processors. P-cores are x86 performance cores, and E-cores are smaller, lower-power, lower-performance processor cores found on Intel Core processors. They will be available next year on some Intel Xeon processors.

With this latest release, Intel continues to put corporate energy and muscle into the oneAPI toolkit’s development, signaling the company’s ongoing commitment to oneAPI. Previous improvements and these latest developments underscore Intel’s understanding of the importance of developing software tools that unlock the potential performance enshrined in its latest silicon offerings.

Intel oneAPI and DPC++: One Programming Language to Rule Them All (CPUs, GPUs, FPGAs, etc)

Nvidia has something that Intel and AMD covet. No, it’s not GPUs. Intel and AMD both make GPUs. However, they don’t have Nvidia’s not-so-secret weapon that’s a close GPU companion: CUDA, the parallel programming language that allows developers to harness GPUs to accelerate general-purpose (non-graphics) algorithms. Since its introduction in…

Intel’s Gamble on oneAPI and DPC++ for Parallel Processing and Heterogeneous Computing: An Interview with Intel’s James Reinders

Intel is placing many big bets on semiconductor process improvements, building new fabs and manufacturing plants around the world, new packaging technologies, and even software. One of those bets, or perhaps a group of bets, is oneAPI and Data Parallel C++ (DPC++), which are an open, cross-architecture programming model that…

Intel Announces Stratix 10 NX

Intel has announced what they call their “First Intel AI-Optimized FPGA,” the Stratix 10 NX family. The company says these FPGAs “will offer customers customizable, reconfigurable and scalable AI acceleration for compute-demanding applications such as natural language processing and fraud detection.” Intel has bet on all the horses in the…