Model Routing Is the New Caching

In 2008, if you wanted to scale a web application, the answer was caching. Put Memcached in front of your database. Cache the expensive queries. Serve the cheap, fast path for 95% of requests, and pay for the real database hit only when you had to.

The economics of AI in 2026 look a lot like that problem, and the solution is structurally similar: route requests to the cheapest model that can handle them, and reserve expensive models for what they’re actually good at.

The Cost Problem

Most AI practitioners start the same way: pick a capable model, use it for everything. GPT-4 or Claude for every task. It works. It’s simple.

It’s also expensive at scale.

The per-token cost difference between frontier models and smaller local models is roughly 100:1. A GPT-4-tier model costs $10-20/million tokens. A capable smaller model — Llama 3.1 8B, Gemma 2 9B, Mistral 7B — runs locally for effectively $0 after hardware, or costs $0.10-0.30/million tokens via API.

When you’re doing 10 queries a day, this doesn’t matter. When you’re running an agent pipeline that processes 10,000 tasks, it does.

What Each Model Tier Is Actually Good At

This is the key insight: different task types have radically different capability requirements.

Cheap model territory:

Categorization and classification (“is this invoice for software or hardware?”)
Extraction from structured text (“pull the date, amount, and vendor from this receipt”)
Summarization of factual content (“summarize these 5 bullet points”)
Format conversion (“transform this JSON to CSV”)
Simple routing decisions (“does this message need a human or is it routine?”)

Expensive model territory:

Complex multi-step reasoning (“should we extend credit to this customer given X, Y, Z?”)
Nuanced judgment calls (“is this edge case within our stated policy?”)
Writing that requires coherent voice and argument
Novel problem-solving without clear patterns

The mistake is using frontier models for the first category. You’re paying 100x for capability you don’t need, on tasks where a smaller model is already at ceiling performance.

The Architecture

Model routing in an agent pipeline works like a triage system:

Classifier (tiny model, near-zero cost): What type of task is this?
Router: Map task type to the appropriate model tier
Worker (cheap model, 80-90% of traffic): Handle routine tasks
Escalation path (expensive model, 10-20% of traffic): Handle complex cases

The classifier itself costs almost nothing — a small model making a binary or categorical decision about task complexity. The value is in what it unlocks: 80-90% of your traffic goes to the cheap path.

Solopreneurs running profitable AI content businesses are doing this. $3-12k/year total AI spend while running pipelines that would cost $100k+/year using frontier models for everything. The model routing is the margin.

The Caching Parallel

Caching worked because most web traffic is reads of hot data — a small working set that can be cached, with writes and cold reads falling through to the database. The 80/20 applied: 80% of requests hit cache, 20% hit the database.

Model routing works because most AI tasks are routine — extraction, classification, formatting — with a minority requiring genuine reasoning. The same 80/20 applies: 80% cheap, 20% expensive.

The engineer who understood caching in 2008 had a structural advantage. The AI practitioner who understands model routing in 2026 has the same advantage: they can profitably run at a scale that makes competitors’ cost structures unviable.

Cache the cheap path. Route to the expensive model only when you have to. The rest is economics.