Why Enterprise RAG Projects Fail in Year Two

Most enterprise AI projects look good at twelve months. The model works. The demo impresses. The initial cohort of users is engaged. Leadership extends the budget. Someone schedules a case study.

By month eighteen, the cracks appear. By month twenty-four, the project is either quietly descoped or handed to a team that wasn't involved in building it.

This is not a model problem. It is an operational one.

The Year-One Demo

The first year of an RAG deployment is unusually forgiving. The corpus is curated and fresh. The evaluation set was built by the team that also built the retrieval pipeline — a conflict of interest that rarely gets named. The users who show up first are enthusiasts, not the median employee. The failure modes that will dominate in production are invisible because volume is low.

In this environment, almost any reasonable architecture looks adequate. The team hits the initial benchmark. The project gets approved.

The Year-Two Cliff

What changes in year two is unglamorous: the corpus goes stale, the user base widens, and the edge cases multiply faster than the team can write eval cases for them.

Stale retrieval is the most common failure. Documents are updated, deprecated, or superseded. The pipeline doesn't know this because no one built a corpus governance process — that seemed like an operations problem, not an AI problem, and it got deferred. Now the model is confidently retrieving outdated policy documents and no one has a dashboard that would surface it.

The evaluation suite stops keeping pace. The team wrote a hundred eval cases at launch. Production now sees ten thousand distinct query patterns a week. The eval coverage is effectively zero, but the dashboard still shows 87% accuracy because no one has updated the denominator.

Ownership fragments. The ML team owns the model. The data team owns the pipeline. The product team owns the user experience. Nobody owns the outcome. When something goes wrong — and something always goes wrong — the incident bounces between three teams before anyone claims it.

Four Failure Modes

From the pattern across multiple deployments, the failures cluster into four categories:

Corpus drift: no process for identifying and re-indexing changed source documents. Fix: automated staleness detection and a quarterly corpus audit owned by someone with a job title.

Eval atrophy: evaluation sets built once and never updated against production query logs. Fix: a feedback loop that surfaces low-confidence retrievals and routes a sample to human review weekly.

Latency tolerance: the p95 latency that was acceptable at launch is unacceptable at scale. Fix: load test before you go wide, not after.

Shadow escalation: users find workarounds for failure cases and stop reporting them. Fix: make failure reporting easier than the workaround.

What Survives

The RAG projects that hold up at thirty-six months share one characteristic: they were run as products, not projects. There was a named owner, a user feedback mechanism, a corpus governance process, and a quarterly review that measured production accuracy against the initial benchmark — not a held-out test set.

None of that is technically interesting. All of it is load-bearing.

If you're planning an enterprise RAG deployment, the technical architecture matters less than you think after a certain threshold of quality. What matters after that threshold is whether the organization is built to maintain what it built.