The Black Swan Fallacy: Why a failure of imagination is irrelevant to Resilience Planning
Every time there is a major incident, whether it be a global pandemic or a natural disaster. Whether it be an IT Outage or a bout of unseasonably hot or cold weather, the rallying cry of those trying to defend the paucity of their response to the unfolding events is now cliche:

“We never thought things would ever get this bad, please bear with us”

Things have got so ridiculous that marketing blogs are helpfully giving suggestions on different ways companies can restyle the phrase “Unprecedented Times“. I particularly like the alternatives “Cray times” and “Hot mess times”, both of which fall into the original and casual section of the UT pie chart – don’t go for challenging, it’s far too overused and formal! Anyway, whichever words you use to describe these 12-monkeys like times, it shouldn’t matter one jot how Nintendo-hard an event was to comprehend when it comes to your response. Here’s why…

The Black Swan

If like me, you have been getting into the Roman-era satire of Juvenal during these bizarre times, you will know it was long- believed that a black swan did not exist.

“Do you say no worthy wife is to be found among all these crowds?” Well, let her be handsome, charming, rich and fertile; let her have ancient ancestors ranged about her halls; let her be more chaste than the dishevelled Sabine maidens who stopped the war—a prodigy as rare upon the earth as a black swan! yet who could endure a wife that possessed all perfections? I would rather have a Venusian wench for my wife than you, O Cornelia, mother of the Gracchi, if, with all your virtues, you bring me a haughty brow, and reckon up Triumphs as part of your marriage portion.”

Europeans then hopped into their boats and began exploring the world. At some point in the mid-1800s, these not so ancient mariners (I also read modern poets too) discovered these elusive black swans in Western Australia. So the Black Swan did exist, and the metaphor was lost to the annals of history – not quite! The metaphor metamorphosised to mean that an impossibility could later be an event simply not imagined at the time.

Cue Taleb!

So how did the term Black Swan enter into the lexicon of risk management and business continuity planning? Well, Nassim Nicholas Taleb wrote a book – The Black Swan. Taleb describes Black Swans as having three attributes:

  1. They are outliers,
  2. They carry extreme ‘impact’.
  3. In spite of their outlier status, human nature makes us concoct explanations for their occurrence after the fact, making them explainable and predictable.

The book is well-written and uses stories to support the explanation of complex risk management theory. Taleb mainly focuses on economic risk but the lessons taught are used in many other areas of risk. In particular, the field of operational resilience. I think Taleb nailed it on what a Black Swan is and human ineptness in attempting to predict the future but, in my humble opinion, Taleb fell short on explaining how to deal with Black Swan events i.e. by turning the problem of predicting such events on its head!

The definition of insanity…

Pre-pandemic business continuity tests followed a pretty standard format. Trawl the Internet for a recent disruptive event, change the details slightly and then test the organisation’s plans against that scenario – usually at a desktop level only. Some organisations would go a little further and invoke a day at their Work Area Recovery site. Many would simulate a controlled failover of their data centre from PROD to DR for a couple of hours and then fall back to the PROD. Then a disaster happens and the same excuses get trotted out. “We never thought that situation would have a knock-on effect on our business” or “We never thought about that bit” or “We didn’t want to test that bit because had it broke we weren’t sure if we could recover it before it impacted the business”. I believe it was Einstein who suggested the definition of insanity is to keep doing the same thing over and over again and expecting different results – yet that’s what organisations around the world do with their operational resilience programmes – it’s simply bonkers! But what should be done instead?

Focus on chaos

What if we turned the question around. Instead of scenario testing what the business would do in the event of an alien invasion, coinciding with a comet hitting the sea, causing a mega-tsunami; what if we just instead:

Test until the impact of losing X or Y becomes unacceptable – then build in the necessary resilience to never have an unacceptable loss occur.

By flipping the question around, we are now in a position to test for each and every scenario possible – there will never again be taxing times (I’m still working through those alternatives). We simply take X or Y (or possibly X and Y) out of the picture and carry on until someone shouts stop! This idea is not even particularly new. It’s a key concept within chaos engineering. Chaos Engineering is the facilitation of experiments to uncover systemic weaknesses. The experimental process has 6 steps:

  1. Define what acceptable should look like
  2. Hypothesise that acceptable will continue in both a control group and an experimental group.
  3. Introduce variables into the experimental group that reflect real-world events like servers that crash, key people off sick, buildings no longer available, suppliers going out of business.
  4. Attempt to disprove the hypothesis (stated in step 2) by identifying the delta between the control and experimental groups. i.e. When a server crashed, did acceptable continue in both groups?
  5. Where a delta does exist between the two groups, close the delta.
  6. Start the process all over again

The harder it is to get to an unacceptable level in the experimental group, the more confidence business can have that whatever the scenario it faces, the business will cope. If a weakness is uncovered, we fix it, or until we do, we know that an unacceptable impact could absolutely happen – and why!

Bringing structure to chaos

So how do we introduce chaos into our resilience planning? Can chaos be structured? To a certain extent it can. We can use the all-hazards approach to resilience planning. The all-hazards approach is designed for exactly the type of resilience planning chaos engineering brings to the table. All-Hazards assumes any scenario will cause one or more of the following PLATS impacts:

  • Loss of one or more (P)eople
  • Loss of one or more (L)ocations
  • Loss of one or more (A)ssets
  • Loss of one or more (T)echnologies
  • Loss of one or more (S)uppliers / third parties

Utilising PLATS we can now define unacceptable levels of impact to each loss type up to an including total loss. We then just use PLATS as variables in our chaos engineering experiments. Pick an Asset. Pick a Supplier. Take them out of the picture – what happens? Nothing? Great! Pick some more. Change the combinations. When it breaks, build in resilience until it no longer breaks. Repeat ad infinitum until nothing breaks anymore…then keep doing it – new things will break!

No more Black Swans! No more excuses!

Thank you, Mr Taleb, your job here is done. We need no longer discuss Black Swan events or our inability to predict the future. Business no longer needs to fail as a result of a lack of imagination. We need no more hear politicians or CEOs talk about a lack of preparation for such disturbing times (I’m still going). Just apply chaos engineering to the PLATS impacts and resilience will be baked in by design. If you need help implementing this in your business, get in touch, we are more than willing to introduce some chaos into your life 😉

