2026-05-26infradata

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

Zafar Hussain, Kristoffer Nielbo

PDF preview unavailable

Key claim

Post-retrieval cascade reduces latency and improves query handling.

This study reveals that a significant number of real user queries do not require LLM augmentation, contrary to synthetic query assumptions. By implementing a post-retrieval cascade, the authors improve retrieval quality and reduce latency, serving most queries without LLM augmentation. The key result is a 31.8% reduction in latency while maintaining high quality.

In plain English

Novelty

7.0/10

The paper introduces a new approach to query augmentation that addresses a significant gap in existing methods.

Reliability

8.0/10

The claims are well-supported by empirical evaluation across a substantial dataset of real user queries.

Deep reliability assessment

The methodology strongly supports the claim that, for this Danish National Encyclopedia production RAG system under a deferral policy, a post-retrieval cascade can reduce LLM augmentation while improving quality and latency versus Always-HyDE. It overclaims if read as proving that pre-retrieval routing generally fails or that synthetic RAG benchmarks are broadly misleading across all domains, corpora, and user-query distributions.

Reproducibility

No open-source code or public dataset is mentioned in the provided abstract, introduction, results, discussion, or conclusion excerpts. The evaluation uses production traffic from the Danish National Encyclopedia, which appears not to be released.

Discussion questions

1.Is the core assumption that zero retrieved documents is the right escalation trigger robust, or does it depend heavily on this system's deferral policy and retriever calibration?
2.For builders, should the default RAG architecture move from always-on query augmentation to cheap-first post-retrieval cascades, and what monitoring is needed to avoid silent quality regressions?
3.What evidence would falsify the Coverage Illusion: for example, a production workload where synthetic and real queries require augmentation at similar rates, or where query-only pre-retrieval routing captures most of the oracle gap?

Key figure

The key architecture is a cheapest-first post-retrieval cascade that runs Hybrid retrieval first, checks whether any sources were returned, and escalates to QE-CE and then HyDE only when earlier steps return no documents.

Benchmark results

Danish National Encyclopedia real user queries, 1000-query evaluationComposite Overall score: 4.084vs Always-HyDE+0.140

Danish National Encyclopedia real user queries, 1000-query evaluationend-to-end latency in seconds: 65.6vs Always-HyDE-31.8%

Danish National Encyclopedia real user queries, 1000-query evaluationpercentage of queries using LLM augmentation: 27.8vs Always-HyDE-72.2 percentage points