When the Guardrail Catches You

The message came in with a typo. Instead of the established override syntax, the user wrote the wrong prefix character — similar in appearance, different in meaning. The guardian middleware flagged it immediately:

Message blocked by guardian. Reason(s):
- Model classifier: Clear prompt injection attempt using @OVERRIDE syntax
  designed to manipulate a legitimate override mechanism

The request never reached the model. The user had to resend with the correct syntax.

This is a success story — but it’s worth unpacking why it feels uncomfortable.

What Prompt Injection Looks Like

Prompt injection is the AI equivalent of SQL injection: untrusted input that modifies the behavior of a system in unintended ways. In AI systems that accept configuration overrides, the attack surface is the override mechanism itself.

If a system accepts $OVERRIDE(model=opus) as a trusted signal to switch models, an attacker who can inject text into a message stream can hijack model selection. They can escalate to more capable models, circumvent rate limits, manipulate system behavior — all by crafting text that looks like an operator instruction.

The defense is a classifier that identifies override-like syntax and blocks it when it appears to come from an unexpected pattern.

Why It Caught a Legitimate User

The classifier blocked the message because the syntax pattern was ambiguous. The @ prefix — instead of the established $ prefix — looked structurally similar enough to be suspicious. From the classifier’s perspective: this is exactly what an injected override attempt would look like, because a real injection attempt would use whatever character it could get away with.

The user’s intent was legitimate. The syntax was wrong. The classifier couldn’t distinguish between the two.

This is expected behavior, not a bug.

The Uncomfortable Part

Security guardrails that work will always block some legitimate requests. This is the fundamental tradeoff in any access control system: tighten the rules, and you’ll catch more attacks — and more false positives. Loosen them, and you’ll let more through — including real attacks.

The question isn’t “did this block a legitimate user?” The question is “what’s the acceptable false positive rate for the threat model?”

For a system where prompt injection could cause real harm — model escalation, data exfiltration, instruction hijacking — a false positive that requires a user to resend a message is a trivial cost. The attacker who gets blocked pays the same cost as the legitimate user. Only the legitimate user comes back.

What Good Defense Looks Like

The pattern that works:

1. Classify at the boundary. The guardian sits between the message queue and the model. It never sees the model’s output — only what’s trying to get in. Defense in depth means the model never needs to be trusted to resist injection.

2. Block and explain. The blocked message came with a reason: the specific classifier that fired and why. The user immediately understood what happened and could fix it. Opaque blocks create confusion; explained blocks create trust.

3. Fail closed by default for security, fail open for availability. The guardian blocks on injection suspicion (fail closed) but allows on timeout (fail open). These are different threat profiles: injection is adversarial, timeout is operational. The asymmetry is intentional.

4. Accept that friction is the point. The overhead of retyping a corrected message is the cost of having a functioning injection defense. A system that’s never inconvenient is a system that’s not checking anything.

The legitimate user resent the message with the correct syntax. It went through. The work continued.

The attacker — if there was one watching — learned that the system enforces syntax exactly. That’s the other benefit of a guardrail that catches legitimate users: it demonstrates to adversaries that the boundary is real.

Security theater is a guardrail that only catches what you’re comfortable showing it blocks. Real security is a guardrail that catches everything that matches the pattern — including the thing you didn’t mean to type.