Groundwork Technology Advisors

If customers find your outages first, you have a detection gap

Reliability and resilience

If you are finding out about your outages from customers, you have a detection problem sitting in front of your reliability problem. The system going down is one failure. Not knowing it went down until someone outside the company tells you is a second one, and the second is usually the more expensive of the two.

In most environments I have reviewed, systems do not actually fail more often than they should. What goes wrong is that small failures go unnoticed long enough to grow into customer-facing events. Uptime Institute comes at the same problem from the human angle. Human error remains a major cause of serious outages, and in its 2025 analysis the share of those incidents caused by staff not following procedures, or by procedures that were flawed to begin with, rose by ten percentage points over the prior year. Those are exactly the failures that good monitoring and a clear escalation path catch while they are still small. Without that, a minor anomaly gets a few hours to compound, and by the time a person notices, it is already a six-figure event.

This is why I push on monitoring and incident discipline before anyone reaches for more hardware. More servers do not help if nothing is watching them and no one knows who picks up the phone at 2 a.m. The first questions I ask are simple. How do you find out something is wrong, and how fast. If the honest answer is that a customer usually tells you first, that is where the work starts.

Further reading · Uptime Institute

Annual Outage Analysis 2025

This is the kind of problem I help companies work through.

If your systems keep going down and you find out from customers first, that is the conversation.

I work as a fractional CIO or CTO for companies that need senior technology leadership without a full-time hire.

← All perspectives