When Products Fail

When products fail, we start to wonder

Spectacular failures tend to attract much attention. It is simple curiosity as to why it happened and why something was not done to prevent it. One such event was the engine fire on a Boeing 777 flight over Denver in February 2021. This was documented with a video shot from within the plane. The front of the engine is all shattered, the back of the engine is on fire, and later reports showed parts of the engine scattered over somebody’s backyard. Early reports stipulate metal fatigue of one of the engine’s blade which caused it to break off within the engine as the cause of the fire.

Designing for the one-off failure

There are two key questions to ask: How was it possible for this to happen and, how was it possible for the plane to land safely? I asked a friend who is an experienced jet engine designer. He said the answer is simple: engineers did predict this type of a failure, did study the possible consequences, and did design the rest of the plane so that it could respond appropriately, and the pilots were trained on how to react. He knows that because the plane landed safely with nobody being injured. In other words, the system worked as designed.

He then explained that a failure like this is always considered in two ways: how to prevent failure of the part, and how the rest of the design needs to respond in case of such failure. He said one needs to think of a system instead of a product and follow the following steps:

The structural integrity and lifespan of the part (the blade) are studied using simulations and physical tests making sure that failures cannot occur within the 3-sigma of the operational space.
The surrounding structure (the engine) of the part is purposefully designed to carry the physical and functional load of the part in case it fails.
The system within which this structure exists (the plane) is designed to function in case of a total failure of that structure.
Planning for human intervention (depot maintenance procedures and pilot training) is made part of each design decision.

He also pointed out that this entire decision process is based on the Baysian probability of a failure (look it up, I had to…) because one cannot crash a plane to know what that probability really is. And that the safety factors/margins are huge by design; for example, Boeing 777 is designed to fly half of its scheduled journey even if both engines fail since that half is the furthest possible distance to the closet airport.

So, while some failures may look like spectacular failures of engineering, taking a system approach to the design turns that interpretation on its head—that engine failure was in fact a great success of a system design. What apparently failed is a sufficient frequency of the blade inspections to compensate for the shortcomings of the predictive simulations during the design of the part and the Baysian probability approach.

Designing for the certain failure

There are situations in which the product is expected to encounter a catastrophic event. This came up during a discussion I had about fighter jets and the threat of heat seeking missiles. The plane cannot outrun the missile and the engine cannot be shut-off to hide its heat. And if the missile does not destroy the plane at the instant of the explosion it will likely cause catastrophic damage to the engine and the plane’s hydraulic and electronic controls. It is a no-win situation. But is it?

Apparently, this was debated without resolution until somebody stepped back and looked at the problem as a system of systems: a missile, an engine, and the interaction between the two. If something could be done to neutralize the negative effects of the explosion, then it would not matter if the explosion occurred. And there are only two such negative effects: explosion shock wave and missile debris.

The result was a conceptual design of a jet engine whose hypersonic thrust was sufficient to eject the debris and the shock wave before they could reach the internal parts of the engine. Brilliant! I cannot tell how far this concept has been taken in practice, but I do not recall too many US jet fighters being lost in a war for a long time. Again, a system approach to design transformed the certainty of a failure into a success.

Conclusion

Not all product failures are unavoidable, but most are predictable and Systems Thinking has proven to be very effective identifying and mitigating their impact. The key is to start with a system-of-systems model that accounts for the emergent behaviors between the systems (with a known or a Baysian probability)—and make that the start of a digital thread. That thread is key to tracing failure patterns against a history of the design (requirements, simulations, implementation domains, changes, etc.) and correcting the related assumptions in the system definition or any other parts of what the thread connects. It is also critical to relating input from the field (ex: IoT) to the proper digital twin of a serialized asset that accounts for all maintenance and modification activities after it left manufacturing.