Prioritize Education; Optimize Uptime

How identifying knowledge and skills gaps can reduce human-related risk in data center outages

Holding near-unanimous global acceptance as the 4th utility, the digital infrastructure industry is under relentless pressure to perform. 24/7/365 operations are the expected norm as four out of five businesses require a guaranteed uptime of 99.99 percent from their cloud service vendors.

Data centers are the scaffolding of our everyday lives, upholding global services not only in business activities, but also, crucially, in the ecosystem of the modern connected world.

According to IDC, organizations average 69 hours per year of unplanned downtime across systems due to human error. Coupled with the fact that this unwelcome interruption costs an estimated 5,400 USD to 7,900 USD per minute, an outage is an expense that few businesses can easily endure. Downtime costs are just too serious to leave to chance.

A necessary consequence following any outage or other expense of this kind is to investigate what, or indeed, who was responsible in order to identify the point(s) of failure and implement the necessary measures to ensure it doesn’t reoccur. Uptime Institute’s Annual Outage Analysis (2022) recorded that nearly 40 percent of organizations say they have suffered a major outage caused by human error over the past three years. In reality, this figure may be even greater. Based on collating 25 years of industry data—and considering that human error often plays a role in outages attributed to other causes—Uptime Institute estimates that a more accurate picture is one of human error accounting for around two thirds of all data center outages.

The Outage Analysis goes on to report that, when asked about the cost of their most recent outage, “a quarter of respondents say the outage had cost more than 1 million USD in both direct and indirect costs,” which represented a significant increase from 2021’s figure of 15 percent of respondents. Furthermore, based on past public data center downtime data, Uptime Institute predicts there will be at least 20 serious, high-profile IT downtime incidents worldwide each year.

Sarah Parks, Director of Marketing & Communications at CNet Training

So, with human error clearly named as the most likely culprit, what exactly is it, and can anything be done to mitigate it? There is rich scientific literature exploring varying definitions of human error, but all definitions agree that the term is used to describe an action that results in an undesired negative consequence.

Of course, we can’t forget there is always a little accidental behavior involved in the figures, which could be beyond the reach of prevention, but in order to reduce the chance of errors, we need to differentiate between human error and human-related risk. The crux of that distinction is whether anything can be done to stop it. While human error is inevitable to a certain degree, managing and mitigating human-related risk is not only possible, but also imperative in order to deliver the uninterrupted services that the world demands.

This story is part of a paid subscription. Please subscribe for immediate access.

Subscribe Now
Already a member? Log in here