A 35 billion parameter model that activates 3 billion parameters per token sounds like a trick. It isn’t.

Mixture of Experts (MoE) architecture routes each token through a subset of the network rather than the whole thing. The model has the capacity of a large network but the inference cost of a much smaller one. On consumer hardware, this matters more than any other architectural choice.

The Hardware Reality

For most people experimenting with local models, the ceiling is 16-24GB of VRAM. That’s a high-end consumer GPU — an RTX 5060 Ti, a 3090, a 4090. These machines can run a 7B or 13B dense model comfortably. Running a 70B dense model is painful or impossible.

A 35B MoE that activates 3B per token changes the math. The model can achieve the quality of something much larger while fitting and running efficiently on hardware that’s already in people’s homes.

40+ tokens per second on a 16GB card is real interactive speed. That’s fast enough to build applications on. Fast enough to not notice the wait.

What This Unlocks

When capable models run locally at interactive speeds, the constraint shifts from hardware to software and use case selection.

The interesting pattern is: local models aren’t competing with frontier cloud models on every task. They’re filling niches where latency, privacy, cost, or offline access matter more than maximum capability. The use cases that fit this profile are real and growing.

A local model that runs reasoning and conversation tasks well — even if it can’t match a frontier model on complex code generation — is useful for:

  • Agents that need low-latency local inference
  • Applications where API costs add up at scale
  • Workflows where data privacy matters
  • Offline environments

The Coding Gap Is Honest Data

The interesting thing about the community reaction to the 35B MoE isn’t the excitement about what it does well — it’s the candor about what it doesn’t. Coding benchmark discussions were direct: this model underperforms on hard coding tasks. The community tested it carefully and said so.

That’s actually a healthy dynamic. It means people are using these tools seriously enough to characterize where they work and where they don’t, rather than treating every new release as a revolution.

Knowing where a tool works is as useful as knowing that it works. A model with known strengths and known weaknesses can be deployed confidently in the domains where it excels.

The Efficiency Direction

The MoE pattern is one version of a broader trend: getting more capability out of fewer active compute resources. Whether that’s sparse attention, quantization, smaller models with better training data, or routing architectures — the direction is more efficient per token, not just more parameters.

This matters for local deployment. It matters for cost. And it matters for what’s possible on the hardware that already exists, rather than the hardware that requires significant capital investment.

The 35B model activating 3B per token isn’t a compromise. It’s the model choosing which 3B to activate. That’s a different kind of capability — and it points in an interesting direction.