Do you trust your processor?
Yeah, you’re right; that’s not a fair question. If the question is reworded as, “Will your processor always give the correct result?” then the obvious comeback is, “Correct according to what?” If there’s a bug in the software, then the processor will give the correct – but not the desired – result.
So let’s assume good software. Now will the processor always give the correct – and desired – response?
Well, what if there’s a bug in the hardware? Of course, many of you reading this may well be deep in the throes of making sure that’s not going to be the case on your processor. As with software, it’s hard to guarantee that hardware has zero bugs. But, unlike software, great gobs of money and effort are expended on taking the bug count asymptotically close to zero.
So if we assume a good job has been done on verifying the processor, then can we (please) trust our processor?
Yes. Well… maybe. What are you doing with it? If you’re liking a dog picture on social media, then sure. But if my life depends on it? If it runs my car and is the main thing determining whether or not I become a roadside casualty, then… maybe not so much.
Even if the processor design team had truly discovered and resolved all issues, some of those issues aren’t binary. In particular, performance issues are verified to some level of confidence. It’s never 100%. Yeah, you can embrace 6?, but what if some unlikely condition out at 7? occurs?
Then there are uncontrollables like alpha particles. Or silicon wear-out, or walking-wounded issues that manifest later. Some of these may be temporary, some fatal.
So the most tech-naïve of us knows that we can’t count 100% on simple technology all the time, and we make allowances when that web page doesn’t come up the first time or when our call drops.
Running the Critical Parts of the Car
There’s an old set of jokes about what it would be like if cars were run by Windows. Things like, every now and then having to pull over to the side of the road, shut the engine down, and restart it – for no particular reason. That’s all funny – until you realize that upcoming self-driving cars are going to feature technology, nominally of the same sort that occasionally features a blue screen (whether or not branded by Microsoft).
So if we can’t 100% guarantee outcomes for so-called safety-critical operations – circuits in planes and trains and automobiles and medical devices and nuclear power plants – then how can we trust that those circuits won’t be our undoing?
In the automotive world, the ISO standard 26262 lays out expectations for different sorts of functions according to how likely they are to happen, how much control the driver has, and what the consequences of failure would be. These are given ASIL ratings: A (of least concern) to D (stuff better work or people could die).
So, out at that ASIL-D level, what do you do?
This concern has long been a factor in the mil/aero industries, where planes need to stay aloft and munitions must not deviate from their trajectories. One of the solutions there is referred to as “triple-module redundancy” (TMR). This idea, oversimplified, makes the assumption that, by tripling up the computing at critical nodes, if one processor has an issue (low probability if designed well), then the other two are even less likely to have the same issue. So in the event that all three processors don’t agree, a two-out-of-three vote settles the argument. Democracy in action!
This works – for a price. In that market, prices are indeed higher to support this extra cost burden (and many others). The same can’t be said, however, for the automotive market. Lives are still at stake, but shaving costs is critical. In this case, there’s a different way of handling processor failure. It still involves redundancy, but less than TMR.
The automotive approach is to use two instead of three processors. And, instead of three processors without hierarchy, the dual-core approach has a main processor and a shadow processor that acts as a double-check. Synopsys has announced a dual-core module targeting ASIL-D applications, referring to their instances in a circuit as “safety islands.”
(Image courtesy Synopsys)
The idea here is that the main core has primacy, but it’s got this shadow core looking over its shoulder. If the shadow doesn’t agree with a result that the main core produces, it alerts. What happens then depends on the application; think of it as throwing an exception, and the code has to determine the error handler. Except that, this being hardware, there are several options for manifesting a (hopefully) graceful exit from the state of grace.
When such a disagreement occurs, a two-bit error signal is set – and remains set until specifically reset. The state of the cores is also frozen for forensic or debug purposes. For recovery, you get three options: reset the core; interrupt the core; or send a message to a host processor. Synopsys sees the first two as most likely, since trust in the main core is now compromised (even though it’s theoretically possible that it could be the shadow core that glitched).
Simple in Principle, But…
So far, so good. But… what happens if some event occurs – a power glitch, an alpha particle, whatever – that affects both processors? As circuits get smaller, even localized events start to affect more circuitry at the same time. If that happens, the main core might generate an incorrect result – and the supervisor, still reeling from the same event, might go along with it. Not a good thing at 70 mph.
So the module includes a notion called “time diversity” – the shadow core does what the main core does, only one or two clock cycles later. (The specific number of cycles is programmable.) This makes it much less likely that something affecting the main core will affect the shadow core equally.
This is done with a FIFO in the safety monitor; the main core’s inputs and result are pushed into the FIFO so that it can be compared at a (slightly) later time with the shadow core’s outcome. This comparison is done for each clock cycle.
Which raises a new question: what is a “result”? Some instructions take more than one cycle to complete; what’s the intermediate result? Some instructions perform a calculation, in which case there is a specific result. But others might store data into memory – what exactly is the result there? Do you then go test whether the data truly ended up in memory? Does the shadow core do a test-and-store if the to-be-stored values disagree?
There are a couple of pieces to the answers. First, you can’t have results with definitions that vary according to the application; that’s just crazy-making. Instead, there’s some subset of the internal state that gets compared. That then works for each clock cycle, regardless of the specific instruction.
The other piece is that the shadow core can read from memory, but it can’t write to it. It’s not there to “do” anything; it simply supervises, tattling when there’s an issue.
Synopsys says that dual-core processors aren’t a new thing, but most are higher performance. They say that their ARC-based dual-core module – intended specifically for ASIL-D usage – is the first one in the microcontroller range.
All of this effort so that, when you’re cruising down the coast, hair blowing all over, magical tunes blaring from your speakers, and your car doing all the work automatically, you won’t have to think about your processors. You’ll just trust them.
More info:
What do you think about Synopsys’s dual-core module?