For years, when we have thought “functional safety” or “safety-critical design,” we’ve pictured airplanes, spaceships, and weapons. All of these systems rely on tons of electronics, and they have to work properly or else bad things happen – either lots of time and money lost or lives lost.
And so, for years, the “mil/aero” world has been its own special thing. Some companies specialize in that business because margins can be good. Others stay away because design cycles are long and unpredictable, and there can be tons of paperwork – and who needs that, right?
So we’ve had a broad general electronics industry, with players vying to be the new killer app on the block and hoping for high volumes, and then there’s been this other corner of specialized design practices and lower volumes.
Well, that’s changing. You may recall in the past that, after your plane landed and was taxiing, the purser would come on and remind you that you had now completed the safest part of your journey and would then wish you a safe rest of the trip – the dangerous part involving you in a car on a highway. Yes, driving can be more dangerous than flying.
Why haven’t we, in this industry, been equally worried about cars? That’s because, historically, there’s been precious little significant electronics in cars. And, where it has grown in the last decade, it was largely simplistic electronic control units – ECUs – running software on some simple processor. Although, to be sure, hardware was important too.
We’ve also culturally accepted the fact that it’s perhaps expected, if not outright OK, that thousands will lose their lives in cars each year, but it’s not expected or acceptable for that to happen with airplanes. We blame drivers for car accidents, but airplane companies for plane crashes. So we hold aero companies’ feet to the fire while going nominally easier on car makers (at least until a widespread systematic design problem – or outright fraud – is identified).
Well, no more. With self-driving cars in the offing, the automotive industry is shoveling massive piles of electronics into future cars. And, with that whole autonomous thing, the driver will no longer be in control, so there’s liability – which changes the equation. So now we have the need for functional safety, but in a high-volume, cost-sensitive industry. That makes it a new beast. What worked in the past – like triple-module redundancy – can no longer simply be extrapolated to cars.
Yes, microcontroller (MCU) makers are responding with targeted MCUs for cars, such as the MIPS announcement recently covered by my colleague Jim Turley. But that addresses the old model of software-on-MCUs that will still exist, but with a bevy of purpose-built silicon alongside it.
An Additional Industry Has to Pay Attention
That’s because, while software will continue to feature as a critical aspect, the volumes can justify dedicated systems-on-chip (SoCs), meaning custom silicon the likes of which hasn’t been so common in cars until now. So an entire new industry has been dragged into the functional-safety scene: EDA. According to Mentor’s Rob Bates, EDA wasn’t really a party to the development of the automotive safety standard ISO 26262, which caught the tools-makers somewhat “flat-footed.”
So when a sophisticated SoC comes around, both its hardware and the software running on it have to pass muster. Plus, there’s one other new aspect we’ll come back to that’s not nearly as clear.
No matter which industry you’re talking about, functional safety boils down to one basic question: What happens when something goes wrong?
There are three aspects to this (plus corollary questions):
- Systematic errors: these are errors in the design itself. The design process is supposed to eliminate these.
- Random transient errors: these would be like alpha particles that can flip a state, but whose effects can, in theory, be corrected.
- Random permanent errors: these would be something like, say, an oxide rupture. No way to correct that; when it’s done, it’s done.
- The corollary questions are:
- How do you prove safety in the face of any of these?
- How can you calculate FIT (failures in time) rates to quantify the level of safety?
We’ve seen EDA companies like Cadence respond with certification kits that help with the specific tools themselves. Mentor’s Mentor Safe program is similarly oriented, providing tools certification data and best practices to help design teams through the process. But 26262 requires more than that: it requires that the design process be certified as well. How can you demonstrate that you’ve handled the possible glitches?
Well, this year at DAC, I had a number of conversations in the area of functional safety for EDA. And it wasn’t so much about proving tools; that’s for the cert kits. This was about tools that let you assess and even fix your design to eliminate systematic errors and handle random ones.
Austemper
Austemper is a two-year-old company that has created tools for addressing random faults – both transient (taking into account the number of cycles required to recover) and permanent. They launched the tools at this year’s DAC. While this includes analysis tools, which a number of EDA players provide, they also have a synthesis tool that would appear to be unique in the industry.
To start with, they run analysis using SafetyScope to identify circuits and blocks that need attention. This isn’t full-on fault analysis, so it runs reasonably quickly (a million gates in a couple of hours; roughly the same amount of time as running a logical-equivalency check (LEC) ).
From there, their Annealer tool can synthesize coverage circuits to detect and correct errors. This can be done on an entire chip or on selected blocks. Example synthesized elements include parity, ECC, and replicated circuits. Once done, they have a RadioScope tool that proves, at a fine-grained level, that the new modified circuit is logically equivalent to the starting circuit – only now with extra safety circuits.
Finally, the FIT rates can be quantified by running a tool called Kaleidoscope, which performs fault injection to prove that faults can be handled safely. They do this by obtaining a voltage-change dump (VCD) file from a “golden” RTL simulation. The results of fault injection are then compared against this golden result.
The tools are relatively fast; they can prove roughly 4000 faults in two hours. Three main factors give them this higher performance: parallel execution, working at the RTL level instead of the gate level, and carefully limiting the scope of the circuit simulation and time range by pruning and restricting work only to the cones of control and observation. It runs hierarchically and tracks results to avoid any overlap or duplication of effort.
OneSpin
Meanwhile, OneSpin has also newly addressed this space. Their approach to dealing with systematic errors harkens back to messaging they were using many years ago: gap-free verification. This gets to the notion that, given a set of design requirements, the design should behave in a way that meets all the requirements and nothing more than those requirements. Every element of the design should be necessary and sufficient. Any behavior that lies outside those specified in the requirements become an issue. So, clearly, this is something that OneSpin has cut its teeth on.
For random errors, however, they have an approach different from – and potentially complementary to – Austemper’s. OneSpin’s Dave Kelf noted that, after simulation-based fault analysis, there typically remain on the order of a couple hundred uncertain faults that need to be checked manually. And real-world speed is such that one can address roughly one such fault per day. But, of course, OneSpin does everything using formal analysis rather than simulation, so this issue goes away.
OneSpin has three applications for handling random errors. FPA starts by pruning non-propagatable faults from future analysis. After all, if a fault occurs and it never gets to or affects an output, did it really happen? Truly navel-gazing stuff, but, from a practical standpoint, time need not be spent on such faults. You could say that such faults are self-handling.
FLA then looks at fault-handling circuits to prove that they work. OneSpin had to add some functionality to their tools to make this second tool work – something that might sound trivial: force and release. Those are, in fact, trivial to do in a simulator, because they’re event-based commands that blend well with a simulation mindset. But they’re not so obvious with formal verification – and yet they were necessary for proving that an injected fault can be handled.
Finally, they have FDA, which quantifies fault coverage. It still requires some time to run – weeks for a large-scale design – but there’s no need to generate scenarios or vectors, as is needed with simulation. And there’s none of the uncertainty and dispositioning that are required for simulated faults, saving literally hundreds of engineer-days.
There’s even some talk with Austemper to see whether a formal engine might be more effective than simulation for Austemper’s Kaleidoscope tool. This is the “complementary” bit that I referred to. It’s not certain whether this will happen, but it shows how different solutions may overlap in constructive ways.
What About Machine Learning?
So we’ve looked at the basic hardware and software issues, but there’s an entirely new beast barreling into town that’s a third way: machine learning. Traditional design involves creating algorithms, testing them, and then implementing them in hardware and/or software. Once done, the algorithm is fixed and can be thoroughly vetted.
But with machine learning, depending on how it’s done, a system’s “design” isn’t complete until it’s learned its stuff.
There are two ways to learn, broadly speaking. With supervised learning, you get a sample training set where you know which ones are which. If you’re classifying animals as aquatic or terrestrial, for instance, each sample has the right answer for training purposes. With unsupervised learning, you don’t get the answers, so the system has to have some way of guessing based on the current state of a model and then checking the guesses to decide whether it needs to change its model.
If you’re not surrounded by this stuff, however, there’s an easy mistake to make – one which I started to make. And that is to think that a chip with a convolutional neural net (CNN), for example, trains through the neural net – that is, that each chip can be trained. And that’s not how it works, as Synopsys and Cadence helped remind me.
Training takes a lot of computing power – more than you’d want on your chip, especially for something that would happen once. Instead, there are cloud-based tools like TensorFlow. You use those to train the CNN – which means to determine the weights or coefficients that will be applied at each stage of the CNN. Those are then “uploaded” into each chip. So the actual silicon that ships will have only the finished, trained network.
There’s a thought that unsupervised training wouldn’t pass muster with functional safety – but that thinking applies only if you have training going on in the system and continuing beyond deployment. That’s not the case, at least today, with automotive-oriented designs, so, really, the mode of training doesn’t matter. What matters is that you have a model that’s going to ship on the silicon along with the hardware and associated software.
That would suggest that someone needs to validate not just the design (hardware and software), but the learned model as well. And if the model is updated, then that new model would also need to be certified. Of course, in any field update of such a system, if any of the software changes (or even hardware, if programmable devices are involved), then the new versions would also need to be certified before applying the update. So recertifying before upgrading isn’t really new; it’s just that, now, we have something yet more to certify: the learned model.
I’ve tried to poke around to see how 26262 handles machine learning, and, based on little incomplete bits of information I’ve seen, my tentative conclusion is that it doesn’t – not in the current version, nor in the update expected in 2018. But it is an active conversation in the community. This is probably worth a separate discussion, so I’ll defer this to the future. But, while the eventual 26262 approach may be in question, the existence of machine learning as a challenge isn’t.
On a final, higher-level note, Mentor’s Mr. Bates noted that the aircraft safety record is largely due to a central US federal authority – the FAA – that analyzes every crash and applies learning to future designs. This is why crashes have become so infrequent. He suggests that such an organization might materialize for cars as well, once the cars are in charge, in order to assure that new learning and best practices accrue to all cars.
More info:
The comment about the FAA is valid, but limited. Other countries around the world have their own investigation agencies. There is already an equivalent authority for vehicles in the US. The National Highways Traffic Safety Administration has the same role as the FAA. They do collect and analyse data but the big difference is in the way an accident is treated. The first task at an aircraft site is to secure the site and begin to collect information, the first task at a car accident frequently is to clear the road and keep the traffic flowing. Until this changes, the work of NHTSA and their peers around the world is going to be hampered. A way to improve the data available would be the mandatory fitting to cars of the equivalent of the “Black box” data recorder.
Just a final thought. In the US about 30% of road deaths (over 10,000 a year) involve alcohol, the world’s third highest – beaten only by Canada and South Africa.
Dick – points taken. I certainly wasn’t suggesting that a US agency would have dominion over everyone. My sense from the NHTSA is that the bar for enforcing change is higher than it is for the FAA. One plane crash can result in plane changes – even grounding of planes for inspections of some new fault. It seems to take a lot more car crashes before action is taken.