Want to Run LLMs on the Edge?

I’ve just heard something that left me flabbergasted. Seriously. I cannot recall the last time my flabber was quite this gasted. All I can say is that if you dare to read this column, your own flabber is in danger of joining mine, so this might be a good time for you to don clothing appropriate to the occasion.

Let’s start with the concept of generative AI (GenAI) models like ChatGPT and Stable Fusion. These are known as large language models (LLMs). LLMs usually run in the cloud; that is, on honking big servers in honking big data centers. Well, suppose I were to tell you that I know of a company that has come up with a way of taking LLMs and running them on low-power processors located at the edge where the “internet rubber” meets the “real-world road”? Even better, suppose I were to tell you that the company in question is making this technology available for us all to use for free? How’s your flabber feeling now?

I was just chatting with Alireza Kenarsari-Anhari, who is the CEO of Picovoice. Based in Canada (there seems to be a heck of a lot of high-technology coming out of Canada these days), the company was founded in 2018. Although this seems like yesterday, it’s a lifetime away in the context of GenAI (remember ChatGPT wasn’t presented to the world until 30 November 2022).

Picovoice started life as a voice AI company with a mission to accelerate the transition of voice AI from running in the cloud to running on edge devices like the Arduino, STM32, and Raspberry Pi Zero.

It turns out that the folks at Picovoice are really, really good at what they do. They originally targeted their solutions at hardware companies, but they quickly discovered that a lot of software companies were also interested in building natural speech capabilities into things like security systems and web browsers. Even NASA is going to use Picovoice technology in its next generation of voice-controlled space applications like spacesuits.

Since the guys and gals at Picovoice wanted to squeeze their technology onto the smallest of processors, they spent a lot of effort figuring out how to implement artificial neural networks (ANNs) very, very efficiently. They also created their own ANN architecture, because even TensorFlow Lite (TFLite) was too big and hairy for what they were doing, and things like TFLite for Microcontrollers wasn’t available at that time (that little scamp didn’t see the light of day until 2019). Furthermore, they also created their own runtime for running neural networks on any processor known to humankind. This is known as XPU, which stands for MPU, MCU, GPU, NPU, etc.

Now, this is where things start to get very interesting indeed. It turns out that if you have a small neural network with only a couple of million parameters (weights), then almost every parameter contributes equally to the accuracy of the model, and it doesn’t much matter where the parameter is in the network.

By comparison, once you start working with neural networks like LLMs with hundreds of billions of parameters, then not all parameters are created equal (which makes me want to paraphrase George Orwell by saying: “all parameters are equal, but some are more equal than others”). In this case, we discover that there is a relatively small number of parameters that are extremely important. We can think of these as the “aristocracy” of parameters. If you perturb these parameters even a tiny bit, they have the ability to make your world go pear-shaped.

Then there’s a bigger group we might think of as “middle-class” parameters. Although they’re important, it’s not fatal if you ruffle their feathers a bit. Finally, we meet the largest group of all, the “working class” parameters, which are not particularly important on an individual basis, but they’re useful to have around—otherwise nothing ends up getting done. To put this another way, this last group of parameters are not individually important, but they contribute to the accuracy of the overall model by their sheer number.

But wait, there’s more, because in addition to having more parameters, LLMs also have more layers. The neural network models we use for things like machine vision have tens of neural layers. By comparison, LLMs have hundreds of layers, but not all layers are of equal significance, and their importance changes depending on the model you are using.

As Alireza told me, “All this got us thinking there should be an algorithm that tells us how to allocate our resources among all these parameters. Almost like a triage.”

After a lot of work, the result is picoLLM, which is an end-to-end local large language model (LLM) platform that enables enterprises to build AI assistants running on-devices, on-premises, and in private clouds without sacrificing accuracy.

If you have a hardware platform with limited resources, like 1 gigabyte of RAM, for example, and you have an LLM with hundreds of layers and 10 billion parameters, for example, then picoLLM can analyze the LLM’s layers and the parameters, determine what’s most important, prune things down, and distribute what’s left across the available hardware resources. All this is extremely fine-grained. Some of the parameters become one bit, some become two bits, some become three bits, and so forth depending on how important they are. In a crunchy nutshell, picoLLM can take a humongous LLM and boil it down into something that will fit into your physical system.

As I mentioned earlier, the folks at Picovoice started as a voice AI company with a mission to accelerate the transition of voice AI from running in the cloud to running on edge devices like the Arduino, STM32, and Raspberry Pi Zero. Now they’ve expanded their mission to accelerate the transition from LLMs running in the cloud to running on the edge.

Obviously, Picovoice is a for-profit company, so why are the folks at Picovoice making their awesome picoLLM technology available for the rest of us to use for free?

Well, it must be acknowledged that Alireza sounded just a little smug when he told me that the guys and gals at Picovoice are in a lucky position in that their voice products are making money and the voice market is on the rise, so they don’t need to raise money and they don’t need investors.

When they started thinking about the next growth enabler, LLMs were the obvious choice. The chaps and chapesses at Picovoice were already good at making ANNs run efficiently with limited resources on the edge, and they realized that many LLMs need to run locally because of cost, privacy, latency, etc. issues.

As Alireza says: “Any cloud user we turn into an edge advocate is a win for us in the long term.” He also told me about the new technology they are working on—something that will take picoLLM to the next level—but my lips are sealed and that will be a topic for another day. In the meantime, do you have any thoughts you’d care to share on any of this?