Article

Why systems go down, and what keeps them up

When a system goes down at the wrong moment, the cost is rarely just the downtime. It is the order that did not process, the clinician who could not pull a record, the customer who watched your status page instead of using your product, and the board asking why no one saw it coming.

The reassuring part, if there is one, is that most outages are not caused by exotic failures. They are caused by ordinary, preventable things. The Uptime Institute, which has tracked IT and data center outages for 25 years, reports that nearly 40 percent of organizations had a major outage caused by human error in the past three years, and about 85 percent of those traced to staff not following procedures or to the procedures themselves being inadequate. Their data also points to power and configuration problems, not mysterious software, as leading causes, with IT and network complexity now driving a growing share. The pattern holds year after year: outages are mostly an operations and process problem, not a technology mystery.

That is good news, because process problems are fixable. Here is what actually moves uptime.

Engineering practices that raise uptime

Define what “up” means and measure it. Set objectives for the few things that matter, whether the application is responding and whether transactions are completing, and track them. You cannot improve reliability you do not measure, and a vague “is it up?” hides the slow degradations that become outages.

Treat change as the main risk. Most incidents follow a change: a deploy, a configuration edit, a patch. Controlled change management, automated deploys that can roll back, and testing before production remove the largest source of self-inflicted outages.

Build so that any one part can fail. Redundancy and automatic failover mean one server, one zone, or one dependency going down does not take the whole system with it. Contain the blast radius so a single failure stays small.

See problems before customers do. Monitoring and alerting on the signals that predict failure, error rates, latency, and capacity, so the team responds to a warning instead of a customer complaint.

Shorten recovery, not just prevent failure. Failures will happen. What separates resilient operations is how fast they recover. Practiced runbooks, clear ownership during an incident, and blameless reviews afterward turn each outage into a permanent fix instead of a recurring one.

These are the practices behind the public engineering standards companies like Google have published in their site reliability work, and they apply as well to a fifty-person company as to a hyperscaler. None require a bigger budget. They require discipline.

Business continuity and disaster recovery

Reliability is the day to day. Business continuity and disaster recovery are the answer to the bad day: a data center loss, a ransomware event, a region going dark. The questions a board should be able to get answered:

How long can each critical system be down before it hurts, and how much data can we afford to lose? Those two numbers, the recovery time objective and the recovery point objective, drive every other decision.

If the primary environment is gone, what is the plan to bring the business back, and when did we last test it? An untested plan is a document, not a capability. The organizations that recover quickly are the ones that have actually run the drill.

Are backups isolated and restorable? Backups reachable from the same network a ransomware attacker is on are not backups.

The federal contingency-planning guidance from NIST lays out the same discipline regulated industries already expect: know your critical systems, set your recovery targets, document the plan, and test it.

What I do

I have taken uptime from 75 to 98.5 percent at a fifty-five-location operator and held it above 99 percent at a multi-billion-dollar insurer, and built the disaster recovery and business continuity plans across distributed sites. None of it was a single heroic fix. It was the unglamorous discipline above applied consistently: measure it, control change, build in redundancy, watch it, and practice the recovery.

If your systems are going down often enough that you are worried about the next one, that is a solvable problem, and usually a faster one to fix than people expect.

Start a conversation→

Written by Jon McAnnis, Principal Advisor at Groundwork Technology Advisors.