The recent CrowdStrike outage is just one more manifestation of a challenging need: resilience and reliability in an increasingly digital world. The attraction of digital systems is their flexibility and adaptability. New software can be created to enhance functionality and improve security. The astonishing array of functions that can be implemented on a high-end smartphone attests to this development. But it is precisely the concentration of functions that programmable devices can offer that draws our attention to the need for resilience, reliability, and backup alternatives. The more we depend on these systems, the more important is our ability to increase reliability and to devise rapid recovery methods. Outages produce cascading failure scenarios when resilience and recovery are not adequately provided for in design and operation. More tools are needed to help anticipate the consequences of failures and to develop designs to prevent or mitigate them.
Too often it is the case that failures are self-inflicted mistakes. Sometimes this mistake is a misconfiguration. Other times, it’s an inappropriate variable setting or a blatant programming error, as in the CrowdStrike case. In addition to inventing better programming environments that help programmers avoid errors, such as the TLA+ framework, and better testing tools beyond fuzzing, we need operational discipline. For example, the roll out of new software should be gradual until confidence can be gained that it works in live operation. The notion of canary releases to see how the software works in the field is very germane. Of course, there must be ways to revert to previously operational versions if a release exhibits unanticipated failure modes. No amount of testing is ever as broad as live operation, especially in a system as large and complex as the Internet.

A related discipline is dependency analysis, in which the designer seeks to understand whether and where there may be single points of failure that could be eliminated. System design includes this kind of thinking precisely in aid of resilience and avoidance of vulnerability. Red Team testing is often effective, though preferably undertaken by parties other than the original programmers (who may unconsciously avoid exactly the failure modes that Red Team testing is intended to expose).
ABOUT THE AUTHOR
Dr. Vinton G. Cerf is Vice President and Chief Internet Evangelist for Google. Widely known as one of the “Fathers of the Internet,” Cerf is the co-designer of the TCP/IP protocols and the architecture of the Internet. For his pioneering work in this field as well as for his inspired leadership, Cerf received the A.M. Turning Award, the highest honor in computer science, in 2004.
At Google, Cerf is responsible for identifying new enabling technologies to support the development of advanced, Internet-based products and services. Cerf is also Chairman of the Internet Ecosystem Innovation Committee (IEIC), which is an independent committee that promotes Internet diversity forming global Internet nexus points, and one of global industry leaders honored in the inaugural InterGlobix Magazine Titans List.
Cerf is former Senior Vice President of Technology Strategy for MCI Communications Corporation, where he was responsible for guiding corporate strategy development from the technical perspective. Previously, Cerf served as MCI’s Senior Vice President of Architecture and Technology, where he led a team of architects and engineers to design advanced networking frameworks, including Internet-based solutions for delivering a combination of data, information, voice, and video services for business and consumer use. He also previously served as Chairman of the Internet Corporation for Assigned Names and Numbers (ICANN), the group that oversees the Internet’s growth and expansion, and Founding President of the Internet Society.