Enabling Voice Control in EVERYTHING

Since the pernicious punctuation police (those who worship the gods of SEO) persist in prescribing that I can no longer terminate the titles to my columns with my old friend, the exclamation mark, I’m forced to become evermore inventive, like shouting “EVERYTHING” (I’m sorry).

As an aside, whilst writing this column, I started to wonder as to the history and origin of the exclamation mark, so I had a quick Google while no one was looking and I found the following on the Wikipedia:

[…] Its evolution as a punctuation symbol after the Ancient Era can be traced back to the Middle Ages, when scribes would often add various marks and symbols to manuscripts to indicate changes in tone, pauses, or emphasis. These symbols included the “punctus admirativus,” a symbol that was similar in shape to the modern exclamation mark and was used to indicate admiration, surprise, or other strong emotions.

The modern use of the exclamation mark was supposedly first described in the 14th century by Italian scholar Alpoleio da Urbisaglia. According to 21st-century literary scholar Florence Hazrat, da Urbisaglia “felt very annoyed” that people were reading script with a flat tone, even if it was written to elicit emotions. The exclamation mark was introduced into English printing during this time to show emphasis.

It was later called by many names, including point of admiration (1611), note of exclamation or admiration (1657), sign of admiration or exclamation, exclamation point (1824), and finally, exclamation mark (1839).

So, “da Urbisaglia felt very annoyed,” did he? I can only imagine how annoyed he might have felt if, after taking the time and effort to invent the exclamation mark, he was forbidden from actually using it!!! (Yes, THREE exclamation marks… I’m not proud… or embarrassed.)

Do you remember my 2023 column Is the Future of AI Sparse? That was based on a conversation I had with Sam Fok, who is one of the co-founders of Femtosense.ai. As part of our conversation, we discussed the concept of sparsity and sparse artificial neural networks (ANNs). The idea is to zero out values in the ANN to remove unnecessary parameters without affecting inferencing accuracy.

If the truth be told, sparsity comes in different flavors. We start with the concept of sparse weights in which sparsely connected models store and compute only those weights that really matter, thereby resulting in a 10X improvement in speed, efficiency, and memory utilization. Also, we have the concept of sparse activations, which means skipping computation when a neuron outputs zero, thereby providing a further 10X improvement in speed and efficiency. Now, my biological neural network may not be as speedy and efficient as once it was, but a “quick back-of-the-envelope” calculation reveals that 10X x 10X = 100X, which is better than a poke in the eye with a sharp stick, as they say.

At the time of my previous column, the guys and gals at Femtosense had only recently received their first silicon in the form of the SPU-001. Created using a 22nm TSMC process, this little rascal is presented in a 1.52mm x 2.2mm chip-scale package (CSP). With 1MB of on-chip SRAM (which is equivalent to 10MB effective memory when running a sparse network), the SPU-001 provides sub-milliwatt inferencing for the always-on processing of speech, audio, and other 1D signal data (like from an accelerometer or similar sensor, for example).

In the early days, the chaps and chapesses at Femtosense were selling the SPU-001 directly to other companies to be included in their artificial intelligence (AI)-enabled products, like hearing aids and headsets. In this case, the customers oversaw connecting the SPU-001 to a host processor.

While this deployment model is great for some companies, others prefer to work at a slightly higher level of abstraction. This is why I was just chatting with Sam again, so he could bring me up to date with the latest and greatest news, which is that the lads and lasses at Femtosense have a strategic partnership with the folks at ABOV Semiconductor.

Meet the SPU-001-boosted ADAM-100 (Source: Femtosense)

Their first joint offering, the ADAM-100, features one of ABOV’s 32-bit Arm Cortex-M0+ processors combined with a Femtosense SPU-001 neural processor unit (NPU), co-packaged to provide an incredibly low-power AI MCU device. All the user needs to do is lay the ADAM-100 down on a board and connect a microphone, and they are “off to the races,” as it were.

Well… there is a bit more to it than this (there always is), but you get the idea. The thing to focus on at this point is that the ADAM-100 boasts a single-chip solution that can provide on-device AI for audio, voice, and other 1D signals. On-device AI provides immediate, no-latency user responses with low power consumption, high security, operational stability, and low cost compared to GPUs or cloud-based AI. Some example tasks that can be undertaken by the ADAM-100 are as follows:

AI Noise Reduction (AINR): Extracting voices from noisy environments prior to backend processing.
Wakeword Detection (WWD): Waking a device upon hearing a specific word.
Keyword Spotting (KWS): Allowing devices to be controlled with a specific set of keyword commands.
Sentence-Level Understanding (SLU): The ability to capture and discern intent flexibly from various command phrasings.
Voice Identification: Track individual voices for personalization and authentication.
Sound Event Detection: Detect and recognize sounds (like breaking glass) for security and maintenance applications.
Anomaly Detection: Spot anomalies and predict mechanical failures before they happen, thereby supporting predictive maintenance.
Gesture Recognition: UI/UX for things like game controllers and wearables.

Take sentence-level understanding, for example. Different people may express their desires in different ways. For example, “Turn the lights on,” “Turn on the lights,” “Lights on,” “Activate lighting,” “Let there be light,” and… the list goes on.

If you are manufacturing a voice-controlled product, you have a choice: Either you train your users to issue commands in a highly specific way (your users probably won’t thank you for this), or you create the product in such a way that it can understand and respond to commands presented in a variety of different ways (your users will very much appreciate your adopting this approach).

The small size, low-power, high-performance, low-cost characteristics of AI MCUs like the ADAM-100 mean it won’t be long before voice control is embedded in everything, including all our household appliances from dishwashers to thermostats to microwaves to electric kettles to electric toasters. Of course, just saying this made me want to bounce over to YouTube to see the Talkie Toaster scene from the legendary British science fiction comedy TV series Red Dwarf.

My conversation with Sam opened a whole new world to me with respect to the training of these AI models. Users can create, train, and deploy their own AI models on the ADAM-100. Or they can take pre-created (“canned”) models from Femtosense and train these models themselves. Or they can request a full custom solution in which the heroes and heroines at Femtosense do any model creation and training for them.

Now, this is where things get interesting. How do you actually go about training one of these models to understand a suite of spoken commands that may be issued by people with very different accents and speech patterns? More specifically, where do you obtain the data used for training?

One solution is to create a synthetic data set using sophisticated text-to-speech applications that support different accents and dialects. However, Sam says that the highest fidelity and performance is obtained by going out into the world and collecting real-world data from real-world people speaking the required command words and phrases.

I wondered if the Femtosense folks do this themselves, but Sam tells me that this is one of the examples of AI spawning whole new industries because an entire business has sprung up to satisfy the requirements of this form of data collection.

I think we are poised to plunge into a new world of talking toasters (and that’s not something you expect to hear yourself saying every day). How about you? Do you have any thoughts you’d care to share on anything you’ve read here before any AI-enabled toasters and their friends start to express themselves in forums like this?