Resilience And Reliability In A Digital World

Understanding software failures and building guardrails

The recent CrowdStrike outage is just one more manifestation of a challenging need: resilience and reliability in an increasingly digital world. The attraction of digital systems is their flexibility and adaptability. New software can be created to enhance functionality and improve security. The astonishing array of functions that can be implemented on a high-end smartphone attests to this development. But it is precisely the concentration of functions that programmable devices can offer that draws our attention to the need for resilience, reliability, and backup alternatives. The more we depend on these systems, the more important is our ability to increase reliability and to devise rapid recovery methods. Outages produce cascading failure scenarios when resilience and recovery are not adequately provided for in design and operation. More tools are needed to help anticipate the consequences of failures and to develop designs to prevent or mitigate them.

Too often it is the case that failures are self-inflicted mistakes. Sometimes this mistake is a misconfiguration. Other times, it’s an inappropriate variable setting or a blatant programming error, as in the CrowdStrike case. In addition to inventing better programming environments that help programmers avoid errors, such as the TLA+ framework, and better testing tools beyond fuzzing, we need operational discipline. For example, the roll out of new software should be gradual until confidence can be gained that it works in live operation. The notion of canary releases to see how the software works in the field is very germane. Of course, there must be ways to revert to previously operational versions if a release exhibits unanticipated failure modes. No amount of testing is ever as broad as live operation, especially in a system as large and complex as the Internet.

A related discipline is dependency analysis, in which the designer seeks to understand whether and where there may be single points of failure that could be eliminated. System design includes this kind of thinking precisely in aid of resilience and avoidance of vulnerability. Red Team testing is often effective, though preferably undertaken by parties other than the original programmers (who may unconsciously avoid exactly the failure modes that Red Team testing is intended to expose).

This story is part of a paid subscription. Please subscribe for immediate access.

Subscribe Now
Already a member? Log in here