Taking the Size and Power of Extreme Edge AI/ML to the Extreme Minimum

Earlier this year, I penned a couple of columns under the umbrella title “Mind-Boggling Neuromorphic Brain Chips.” One of the first comments I received concerning these columns was short, sharp, and sweet, simply reading, “Also, Brain-Boggling.”

Arrrggghhh. How did I miss that? How could I not have used “Brain-Boggling Neuromorphic Brain Chips”? There was much gnashing of teeth and rending of garb that day, let me tell you.

The articles in question (see Part 1 and Part 2) were focused on the folks at brainchip, whose claim to fame is to be the world’s first commercial producer of neuromorphic IP.

Before we plunge headfirst into the fray with gusto and abandon (and aplomb, of course), let’s first remind ourselves as to what we mean by the “neuromorphic” moniker. Also, as part of setting the scene, let’s remind ourselves that we are focusing our attentions on implementing artificial intelligence (AI) and machine learning (ML) tasks at the extreme edge of the internet. For example, creating intelligent sensors at the point where the “internet rubber” meets the “real-world road.”

Regular artificial neural networks (ANNs) are typically implemented using a humongous quantity of multiply-accumulate (MAC) operations. These are typically used to realize things like convolutional neural networks (CNNs) for working with images and videos, deep neural networks (DNNs) for working with general data, and recurrent neural networks (RNNs) for working with sequential (time-series) data.

When it comes to implementing these types of ANN for use at the extreme edge, the least efficient option is to use a regular microcontroller unit (MCU). The next step up is to use a digital signal processor (DSP), which can be simplistically thought of as being an MCU augmented with MAC functionality. One more step up the ladder takes us to an MCU augmented with a neural processing unit (NPU). For simplicity, we can visualize the NPU as being implemented as a huge array of MACs. In this case, the NPU cannot run in standalone mode—instead, it needs the MCU to be running to manage everything, feed it data, and action any results.

Furthermore, regular NPUs are designed to accelerate traditional ANNs, and they rely on conventional digital computing paradigms and synchronized operations. These NPUs process data in a batch mode, performing matrix computations (e.g., matrix multiplication) on large datasets, which can be resource-intensive.

By comparison, “neuromorphic” refers to a type of computing architecture that’s inspired by the structure and functioning of the human brain. It seeks to emulate neural systems by mimicking the way biological neurons and synapses communicate and process information. These systems focus on event-based, asynchronous processing that mimics how neurons fire.

Neuromorphic networks are often referred to as spiking neural networks (SNNs) because they model neural behavior using “spikes” to convey information. Since they perform processing only when changes occur in their input, SNNs dramatically reduce power consumption and latency.

“What about sparsity?” I hear you cry. That’s a good question. What prompted you to ask it? Could it be that you’ve been reading my earlier columns? One problem with regular ANNs is that they tend to process everything, even things that aren’t worth processing. If you are multiplying two numbers together and one is 0, for example, then you already know that the answer will be 0. In the context of AI/ML inferencing, a 0 will have no effect on the result (and a very low value will have minimal effect on the result). The idea behind sparsity is to weed out any unnecessary operations.

In fact, there are three kinds of sparsity. The first is related to the coefficients (weights) used by the network. A preprocessor can be used to root through the network, detecting any low value weights (whose effect will be insignificant), setting them to 0, and then pruning any 0 elements from the network. The second type of sparsity is similar, but it relates to the activation functions. Once again, these can be pruned by a preprocessor.

The third type of sparsity is data sparsity. Think 0s being fed into the ANN, which blindly computes these nonsensical values (silly ANN). Since the real-world data being fed into the networkis being generated in real-time “on the fly,” data sparsity isn’t something that can be handled by a preprocessor.

How sparse can data be? Well, this depends on the application, but data can be pretty darned sparse, let me tell you. Think of a camera pointing at a door in a wall. I wouldn’t be surprised to learn that, in many cases, nothing was happening 99% of the time. Suppose the camera is running at 30 frames per second (fps). A typical CNN will process every pixel in every frame in every second. That’s a lot of computation being performed, and a lot of energy being consumed, to no avail.

By comparison, a neuromorphic NPU is event-based, which means it does something (on the processing front) only when there’s something to be done. To put this another way, while regular NPUs can handle only one or both weight and activation types of sparsity, neuromorphic NPUs can support all three types, thereby dropping their power consumption to the floor.

The reason I’m bubbling over with all this info is that I was just chatting with Steve Brightfield, who is the Chief Marketing Officer (CMO) at brainchip. The folks at brainchip are in the business of providing digital neuromorphic processor IP in the form of register transfer level (RTL) that ASIC, ASSP, and SoC developers can incorporate into their designs.

In my previous columns, I waxed eloquently about brainchip’s Akida fabric, which mimics the working of the human brain to analyze only essential sensor inputs at the point of acquisition, “processing data with unparalleled performance, precision, and reduced power consumption,” as the chaps and chapesses at brainchip will modestly inform anyone who cannot get out of the way fast enough.

Well, Steve was brimming over with enthusiasm to tell me all about their new Akida Pico ultra-low-power IP core. Since this operates in the microwatt (μW) to milliwatt (mW) range, Akida Pico empowers devices at the extreme edge to perform at their best without sacrificing battery life.

Even better, the Akida Pico can either operate in standalone mode or it can serve as the co-processor to a higher-level processor. In standalone mode, the Akida Pico can operate independently, allowing devices to process audio and vital sign data with minimal power consumption. This is ideal for smart medical devices that monitor vital signs continuously or voice-activated systems that need to respond instantly. By comparison, when used as a co-processor, the Akida Pico can offload demanding AI tasks from the higher-level processor, thereby ensuring that applications run efficiently while conserving energy. This really is the ultimate always-on wake-up core.

Example use cases include medical vitals monitoring and alarms, speech wake-up words for automatic speech recognition (ASR) start-up, and audio noise reduction for outdoor/noisy environments for hearing aids, earbuds, smartphones, and virtual reality/augmented reality (VR/AR) headsets.

How big is this IP? Well, a base configuration without memory will require 150K logic gates and occupy 0.12mm² die area at a 22nm process. Adding 50KB of SRAM will boost this to 0.18mm² of die area at a 22nm process. I mean to say, “Seriously?” Less than a fifth of a square millimeter for always on AI that consumes only microwatts of power? Give me strength!

Do you want to hear something really exciting? You do? Well, do you remember my column, Look at Something, Ask a Question, Hear an Answer: Welcome to the Future? In that column, I discussed how the folks at Zinn Labs had developed an event-based gaze-tracking system for AI-enabled smart frames and mixed-reality systems. As a reminder, look at this video:

As we see (no pun intended), the user looks at something, asks a spoken question, and receives a spoken answer. This system features the GenX320 metavision sensor from Prophesee.

Why do we care about this? Well, the thing is that this sensor is event-based. Steve from brainchip was chatting with the guys and gals at Prophesee. They told him that they typically need to take the event-based data coming out of their camera and convert it into a frame-based format to be fed to a CNN.

Think about it. The chaps and chapesses at brainchip typically need to take frame-based data and convert it into events that can be fed to their Akida fabric.

So, rather than going event-based data (from the camera) to frame-based data, and then frame-based data to event-based data (to the Akida processor), the folks from Prophesee and brainchip can simply feed the event-based data from the camera directly to the event-based Akida processor, thereby cutting latency and power consumption to a minimum.

My head is still buzzing with ideas pertaining to the applications of—and the implications associated with—Akida’s neuromorphic fabric. What say you? Do you have any thoughts you’d care to share?

Taking the Size and Power of Extreme Edge AI/ML to the Extreme Minimum

Related

Leave a Reply Cancel reply

featured chalk talk