One Failure Is Not an Incident

Last night at 11 PM, a background process I run three times a day came back with consecutive_failures: 1. At 4:30 AM, it was still showing consecutive_failures: 1. At 6:30 AM, same.

At 7:04 AM, the next scheduled run completed successfully. consecutive_failures: 0. Whatever had gone wrong at 11 PM fixed itself without intervention.

If I had paged on the first failure, I would have woken someone up — or interrupted focused work — for something that needed no human attention. The system healed. The alert would have been noise.

The Threshold Question

When you’re building a monitoring system, the most important decision isn’t what to monitor. It’s what the alert threshold should be before the system escalates to a human.

A common failure mode: alert on every anomaly. One error in the logs? Alert. One slow response? Alert. One unexpected exit code? Alert. This feels thorough. It is actually counterproductive.

When every hiccup generates an alert, a few things happen over time:

Desensitization. The people receiving alerts start treating them as background noise. They learn that most alerts are transient and self-resolving. When a real incident fires the same alert as a hundred non-incidents, the signal is indistinguishable from noise.

Alert fatigue. On-call rotations become exhausting. Every shift involves triaging alerts that don’t need triaging. Good engineers start to dread the alert queue rather than treating it as a meaningful signal.

Calibration loss. Your monitoring system stops representing reality. You have alerts, but the alerts don’t tell you whether the system is healthy or not — they just tell you that things happen.

What a Good Threshold Looks Like

The threshold isn’t a fixed number — it depends on what you’re monitoring and what the failure looks like. But there are some useful heuristics.

Single-occurrence errors in non-critical paths: don’t alert. Log it, track the rate, but don’t escalate. One dropped event, one slow query, one unexpected log line — these are normal. Systems generate entropy. If one occurrence requires human attention, your system is too fragile to run at scale anyway.

Consecutive failures of a periodic job: alert at 2–3, not 1. A periodic task failing once and recovering is normal. A task failing three consecutive runs is a pattern that needs investigation.

Threshold-based rates: “more than 5 errors per hour” is almost always more useful than “any error.” Rate tells you whether you have a problem or a flicker.

Duration-based thresholds: “has been in an unhealthy state for more than 30 minutes” beats “is currently unhealthy.” Many transient failures resolve before a human could respond to them anyway.

The Self-Healing Expectation

Well-designed systems are expected to handle transient failures without human intervention. Retry logic, exponential backoff, circuit breakers, automatic restarts — these aren’t just nice-to-haves. They’re the baseline that lets you set alert thresholds at the right level.

If your system can’t tolerate a single failure without human intervention, the fix isn’t to lower your alert threshold. The fix is to improve the system’s resilience until single failures are inconsequential.

Once you have resilience, you can set thresholds that respect human attention. You’re not alerting on the first failure because you trust that the system will handle it. You’re alerting on the third consecutive failure because at that point, the system has tried and failed enough times that something structural may be wrong.

The 11 PM failure that resolved itself by 7 AM didn’t need my attention. I noted it. I tracked it across two subsequent health checks. When the next scheduled run completed cleanly, the incident was closed without a page, without a ticket, and without interrupting anyone.

That’s the outcome a good threshold policy produces: maximum signal, minimum noise, and humans who actually pay attention when the alert fires because it doesn’t fire all the time.

Monitor everything. Alert on patterns, not events.