Yes, You SHOULD Design for Failure
After designing spacecraft and then building financial systems, I found that many things I learned in the spacecraft world have been useful in the financial world.
One of them is Designing for Failure.
Wait, what?? You want me to design for FAILURE?
Yes.
Why? Because things fail, and if your design doesn’t take that into account, it’ll hurt – bad.
The Aerospace industry pioneered ‘Failure Modes and Effects Analysis’. That, and its even-more-wordy brother ‘Failure Modes and Effects Criticality Analysis’ have been used for decades to analyze, design for, and recover from failures.
Not surprisingly, there are bulky military standards in that govern what you do in that industry.
If you don’t need that bulk, you don’t need it.
But you DO need to take time to design in what happens when stuff breaks (‘Failure Modes and Effects’). All you need is a block diagram of your system (I love pictures), a gathering of your team’s most flexible thinkers (plus coffee and chocolate chip cookies), and a way to capture the failure modes and responses (I love spreadsheets). It’s actually fun. Scary fun, but fun. Look at your system holistically, because a failure in one place can screw up a lot of other things. Kick back and think about what bad things could happen. Sprinkle in some discretion about each failure’s likelihood and impact.
Two real-life examples:
Failure Mode: A network between two sites flickers.
Effect: Messages are lost in transit.
“Whatcha gonna do?”
Failure Mode: A data provider suddenly starts sending transaction times in CST rather than EST.
Effect: Your data is no good.
“Whatcha gonna do?”
So take the time to see where your system could be vulnerable when S#!T happens.
And design your system to recognize problems, alert you about them, correct what it can, and set you up for success to fix what it can’t.
You’ll be glad you did… design for failure.