The Ops Gap | Delphin Barankanira

There is a gap that appears in almost every enterprise AI deployment, usually somewhere between the first successful pilot and the second budget cycle. The system works in evaluation. It does not work in production in the way anyone expected. The difference between those two states is what I call the ops gap.

It is not a model problem. The model is often fine. The gap is everything else.

The Gap Defined

An evaluation measures a model's accuracy on a held-out dataset assembled before deployment. Production measures something different: does the system produce reliable, appropriate outputs for the full distribution of real queries, in real time, at the volume and latency the business requires, day after day?

These are not the same question. They don't even share the same failure modes.

Evals miss latency. They miss the retrieval failures that only appear when query volume hits a long tail. They miss the corpus staleness that accumulates over two quarters. They miss the cases where the system returns a confident, coherent, wrong answer — because no human is in the loop to notice.

The teams that mistake a strong eval result for production readiness build the gap themselves. They ship to production and then learn what production actually requires.

Why Eval Is Not Ops

Evaluation is a measurement discipline. Operations is a reliability discipline. Treating one as a proxy for the other is the same mistake as treating unit test coverage as a proxy for system reliability.

What evaluation can tell you: whether the model, given inputs similar to your test set, produces outputs that meet your quality criteria on those inputs.

What evaluation cannot tell you: whether the system will maintain that quality over time as inputs drift, as the corpus changes, as traffic patterns shift, and as upstream data pipelines degrade silently.

The ops gap is not a critique of evaluation. Good evals are necessary. They are not sufficient.

The Operating Model That Works

Closing the ops gap requires treating the AI system as production infrastructure, which means owning it the same way you would own a critical API: with monitoring, incident response, a change management process, and a named team accountable for uptime.

In practice, this means four things:

Production monitoring: not accuracy on a static test set, but accuracy on live traffic, sampled and reviewed weekly. If your eval dashboard isn't connected to production query logs, it's measuring the past.

Corpus governance: a documented process for identifying when source documents are updated, deprecated, or superseded — and a pipeline that re-indexes accordingly. Corpus drift is silent and corrosive. You find it six months late.

Incident response: a definition of what "broken" means for your system, who gets paged when it happens, and what the rollback path is. Most RAG deployments have none of these. When something goes wrong, the response is improvised.

Model versioning: a process for testing model updates against production query distributions before shipping, not just against the original eval set. This sounds obvious. It is routinely skipped.

None of this is novel. It is standard operating practice for any production system. The challenge is that the teams building AI systems are often not the teams that run production systems — and the handoff between them is where the gap opens.

The most durable deployments I've seen treated the AI system as infrastructure from day one: with the same seriousness about reliability, the same investment in observability, and the same clarity about ownership. The eval score is the starting line. Operations is the race.