Some products may, if they fail in a certain way, endanger people’s safety or lead to enormous economic losses. That puts much more responsibility on the designers’ shoulders and is why using fail safe design principles is so critical. Let’s take a closer look at these principles and some fail safe examples you can see in today’s products…
Why fail safe design is needed to reduce product risks
Let’s take a car, for example. The brakes and the airbag have to work every single time.
An explanation such as “the sensor did not work, so the airbag did not deploy, sorry” is simply not acceptable. The design engineer must account for that risk and plan for ways to mitigate it (in this example, it may involve using multiple sensors so that an accident gets detected even if one of the sensors fails in its function).
One of the main approaches to reducing the risk associated with those products is to introduce fail safe features to the design. Basically, they are safety nets, to prevent a failure to result in a highly hazardous situation.
Let’s explore fail safe design principles and some examples further…
1. A few preliminary fail safe concepts
An error (step 1) can lead to a failure (step 2) which can lead to a hazardous situation (step 3) which can lead to personal harm and/or disastrous environmental/economic loss (step 4).
When a new product is designed, the company needs to ask questions based on this logic.
If the product is relatively simple — for instance, simple electric circuitry — then making an inventory of possible errors may be the first step. It is usually called an FTA (Fault Tree Analysis).
In all other cases, the company needs to take the analysis in the opposite direction:
- What types of hazardous situations — if any — can lead to high-impact issues (personal harm and/or disastrous environmental/economic loss)?
- How would those hazardous situations come up? What types of failures could trigger them?
- What types of errors can lead to those failures?
This analysis is often formalized as a design FMEA (dFMEA). And it pushes design engineers to come up with ways to reduce product risk in general.
In practice, it usually involves a number of design improvements that can generally be categorized as “fail safe” features.
2. Whenever possible, prevent errors and failures in the product
How to make sure the first steps on the way to serious issues don’t take place? That’s the best approach wherever possible.
More complex products are more likely to fail
Can you simplify your product? Adding such and such features may delight customers… until the product fails.
High-end German cars may come with smart ways to wipe the windshield in a near-perfect manner, but it also means their owners need to go back to the garage more often to fix those advanced systems. (I am probably over-generalizing, but there is some truth to that.)
In contrast, Honda and Toyota have generally been great at avoiding such complexity, and it has given them an edge when it came to reliability for many years. It’s no longer true, but they still benefit from that “very reliable” approach.
Consistent manufacturing quality has a big role to play
Poor preventive maintenance of the equipment, poor training of the operators and line leaders, poor process controls during manufacturing, and poor choice of component suppliers all lead to a lack of consistency in the final product.
For example, an electronic product may stop functioning because of poor soldering that breaks due to vibrations. Or, worse, there may be a short circuit on 1% of the product, and it can send high voltage to the wrong components and start a fire.
As I wrote before, poor manufacturing quality can cause safety issues. It’s all related.
And I mentioned before how a pFMEA and a control plan are great tools to use here.
In some cases, the risk of an error is still too great, and a good answer is to add an inspector. (That’s common in car assembly factories, at the station where airbags are assembled into the vehicle. That’s a regulatory requirement in some countries.) But the best approach is generally to improve the system so the error can’t occur.
Adding redundancy to make the product more tolerant of failures
As I mentioned in the opening of this article, with multiple sensors detecting an accident, the airbags will deploy even if one sensor fails to do its job.
Redundancy is a key word on airplanes, since risk reduction is so important in that context. Here are a few examples:
- If the pilots attempt to take off without the flaps extended (leading to a serious hazard), two distinct alarm systems are activated — a visual signal + a sound alarm.
- Most planes have several engines. If one engine flames out (failure), the other engine is sufficient to keep the airplane flying and for landing.
- If all engines fail, the power to run the cabin instruments etc. is provided by a battery. That battery allows the pilots to attempt to restart the engines and/or to start the auxiliary power unit (which runs on fuel).
The principle here is simple. One system may fail, and it won’t have a dramatic impact. The probability of two independent systems failing on the same flight (and letting a hazardous situation come up) is very low.
Now, independent systems may still all fail at the same time. Sometimes they do. For example, if both engines stop functioning and the battery is not in the expected condition, as happened in Garuda Flight 421, the pilots may not have sufficient power to follow standard operating procedures and the plane may crash. Making this impossible is usually beside the point. Making it extremely unlikely is the right approach.
Prevent people from making mistakes
People cause errors by interacting with products. That’s a fact. Failing to plan for single-point human errors is sometimes a big design mistake.
Let’s take a couple of examples.
- I mentioned that manufacturing processes can cause defects that can then lead to failures. People have been very creative in error-proofing human operations. The book about poka yoke gives a number of examples.
- In a Virgin Galactic test flight that went horribly wrong, a pilot unlocked the feather mechanism too early, leading to an in-flight breakup of their vessel. (The Guardian came up with a nice graph about this.) The design engineers didn’t think it was a risk worth preventing and so didn’t make that action impossible. They got blamed for it.
3. If/when a failure occurs, make it less likely to trigger a hazardous situation (examples)
In case of a failure, can the product still prevent a hazardous situation? Here are a few examples:
“Keep doors open, not closed, in an emergency”
This is the classic example of a ‘fail safe’ feature. When the electricity goes off, electric door locks get unlocked to avoid trapping people inside in a potentially hazardous situation (a fire might be raging in or near the building, and people might get killed if they can’t escape).
Note: it’s about people’s safety, not personal property safety. If a room full of servers, usually without people, presents a different risk profile, the locks might have to behave differently. They might have to remain locked when power is off, and that’s called fail secure, and a lot of consideration is needed to ensure people can still escape in hazardous situations.
“Cut the power when the temperature gets too high”
In many electronic products, power is cut when the temperature gets abnormally high or another abnormal situation gets detected, and that’s becoming increasingly common with the proliferation of cheap sensors.
“Move to a setting that can’t add to the problem”
On some propeller planes, if the mechanism that sets the angle of the blades stops functioning, the blades get in the position where they create the least amount of drag. Otherwise, the blades might create so much drag that controlling the plane’s trajectory would not be possible.
(Image source: http://www.airpages.ru/eng/mn/b20_26.jpg)
4. Notify people in case of a failure that may lead to a hazard, or in case of a hazard (examples)
The product could also notify users in case of a potentially hazardous situation occurring. Let’s look at a few examples.
“Making sure an alarm will ring when needed”
Many electrical circuits lead to activating an alarm. Think of a fire alarm. How do you make it more likely to get triggered when there is a fire?
The key is to set it to sound by default even in a false alert (when the circuit’s integrity is affected). This way, if there is a short in the circuit, it will activate the alarm. This is usually preferable to the opposite situation, where a short would prevent the alarm from ringing when there really is a fire!
This principle is nicely explained on this page.
“Making sure users can see when a gauge has failed”
When a measuring instrument is designed, can it be made to show some type of warning (or no information at all, rather than wrong information) in case it is malfunctioning? That’s a great fail safe design feature for instruments.
The worst that can happen is that a gauge stops functioning but still looks like it is functioning, inducing people to act in error. That was the case with a poorly designed NAV Receiver that led to a plane crash.
P.S. Don’t forget to make it clear to users what their next steps are
Alerting users when they may be at risk is great, but giving them proper guidance is better:
- Should they leave the building by using the stairs?
- Should they stop their car at the next exit?
- Should they grab their emergency checklist?
If you always count on people to be properly trained and to keep a cool head, that may be a dangerous assumption…
A big limitation to fail safes
Sometimes there is a failure but something is preventing the fail-safe mechanism to come into play. Safety engineers need to consider this risk and think of how to mitigate it.
I saw a great comment Philip Koopman wrote in Linkedin and I thought I should mention it here.
If you have redundancy, the hard problem then becomes knowing when to switch over to the backup path.
If you have an intermittent fault that might not trigger the failover, while still degrading functionality to the point that things aren’t working. Or a failed component that is supposed to fail silent instead fails by babbling nonsense. Redundancy is only a start. A robust failover management strategy also matters.
Conclusion: Fail safe design is a must for the designers of many of today’s products
I think it is clear by now how much responsibility designers have in ensuring the safety of their products, be they medical devices, vehicles, or consumer electronics. The good news is, a structured analysis helps greatly, and getting inspiration from all types of fail-safe examples is invaluable.
Maybe you have a few tips or examples of fail safe design to add? Let me know in the comments or by contacting me.