Fail-Open vs Fail-Closed: The Security Decision Nobody Thinks About Until It Breaks
When you build a security layer that sits in the path of every request, you have to answer a question most people skip until something breaks:
If the security layer itself fails or times out, what happens to the request it was evaluating?
The two answers have names: fail-open (allow the request through) and fail-closed (block it). Both are correct in different contexts. Picking the wrong one is a systems design mistake that’s hard to undo.
The Setup
Imagine a gateway that evaluates incoming messages before routing them. It does two things: a fast local pattern-match (cheap, synchronous) and a slower AI-based classification (expensive, async, calls an external API).
The AI classification needs a timeout. External APIs are unreliable — they can hang, rate-limit, return 503s, or just be slow. If you don’t bound the wait time, a slow AI provider turns a real-time messaging system into a queue.
So you set a timeout. Now you have to decide: when the timeout fires, what verdict do you assign?
The Case for Fail-Open (Default: ALLOW)
Fail-open means: if the security layer can’t complete its evaluation in time, let the request through.
This is the right choice when:
The security layer is supplemental, not primary. If you have other layers of defense, a timeout bypass doesn’t create a single point of failure. The AI classifier might miss some things; that’s why you also have the fast pattern-matcher, rate limiting, and downstream controls.
False positives are expensive. In a messaging system, blocking legitimate messages is a visible, user-facing failure. Users notice. It breaks workflows. The cost of a false positive is high and immediate.
The attack surface for timeout exploitation is narrow. Deliberately triggering a timeout usually requires adversarial knowledge of your AI provider’s behavior and rate limits. It’s not a trivial bypass.
System availability is the primary requirement. A system that blocks all messages when its classifier is slow is not a security system — it’s an outage waiting to happen.
The Case for Fail-Closed (Default: BLOCK)
Fail-closed means: if the security layer can’t complete its evaluation, block the request.
This is the right choice when:
The security layer is the last line of defense. If nothing else stands between a malicious request and sensitive resources, a timeout bypass is a real attack surface.
False negatives are catastrophic. In financial systems, authentication gateways, or anything handling regulated data, letting one bad request through can be worse than blocking thousands of legitimate ones.
The user population is adversarial. If the system is publicly accessible and attackers are actively probing, fail-open creates an incentive to trigger timeouts deliberately.
Latency tolerance is high. If users can wait — or requests can be queued and retried — blocking on timeout is survivable.
The Third Option: Fallback to a Cheaper Classifier
There’s a middle path worth considering: when the expensive classifier times out, don’t fail at all — fall back to the cheap pattern-matcher’s verdict.
This works when the two classifiers have complementary properties: the fast one catches most obvious cases; the slow one catches subtle ones. A timeout drops you back to “we only know what the pattern-matcher saw” — which is still more than nothing.
The tradeoff: this requires the pattern-matcher to be reliable enough to serve as a safety net on its own, and you need to log when fallbacks happen so you can detect if timeouts are becoming systematic.
Making the Decision
The question to ask: “What is the cost of one missed detection vs. one false block?”
If missed detection is recoverable and false blocks are disruptive: fail-open. If missed detection is catastrophic and false blocks are tolerable: fail-closed. If neither extreme fits: fallback to a cheaper mechanism.
The mistake is not thinking about this at design time and ending up with whatever the framework default happens to be. Security systems that weren’t deliberately designed to handle their own failure modes are systems that will surprise you when they fail.
They always fail eventually.
The timeout is not the edge case. It’s the operating condition for any system that calls external services. Design for it from the start, not after the first production incident.