2026-06-04reasoningdatacode

Self-Augmenting Retrieval for Diffusion Language Models

Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go, Kilian Q. Weinberger

Key claim

SARDI improves retrieval in diffusion models significantly.

The paper presents SARDI, a novel framework that enhances retrieval-augmented generation by utilizing lookahead tokens from discrete diffusion language models. This approach significantly improves performance on multi-hop QA tasks while maintaining high throughput.

In plain English

The authors developed a new method called SARDI that helps language models generate better answers by looking ahead at potential words they might use. Unlike previous methods, SARDI can quickly find relevant information without needing extra training. This means it can work faster and more effectively on complex questions. Builders should care because it shows a new way to improve AI responses using existing models without extensive retraining.

Novelty

8.0/10

The introduction of a dynamic retrieval framework that leverages lookahead tokens represents a significant advancement in the application of diffusion models.

Reliability

8.0/10

The claims are well-supported by empirical results across multiple benchmarks, demonstrating solid performance improvements over existing methods.

Deep reliability assessment

The experiments support that intermediate denoising states in a reasoning-capable discrete diffusion LM can improve dynamic retrieval for multi-hop QA, with clear EM and latency gains over the paper's training-free AR and diffusion baselines. Broader claims about general applicability and throughput are less fully supported because results are limited to QA-style benchmarks, mostly BM25 retrieval, one hardware setup, and depend on DLMs producing useful reasoning traces, which the paper notes off-the-shelf instruction-tuned DLMs often do not.

Reproducibility

Yes for code: the abstract mentions https://github.com/pauljngr/SARDI. Datasets are public or benchmark-derived: 2WikiMultiHopQA, HotpotQA, MuSiQue, CofCA, and SynthWorlds-RM; retrieval setup details include BM25, K=7 passages, and latency on a single NVIDIA B200 GPU.

Discussion questions

1.Does SARDI's core assumption hold beyond multi-hop QA: do low-confidence diffusion tokens reliably expose useful future entities, or are they mostly benchmark-specific artifacts of entity-centric reasoning?
2.For builders, when would the latency and infrastructure complexity of iterative retrieval during denoising be worth it compared with simpler AR RAG, reranking, or query-rewriting pipelines?
3.What evidence would falsify the result: would SARDI lose its advantage if evaluated with stronger dense retrievers, longer open-ended generation, noisy enterprise corpora, or DLMs that do not emit stable reasoning traces?

Key figure

Figure 1 shows a diffusion LM partially denoising an answer, surfacing the bridge entity "Louvre" before finalizing the output, then using that tentative token to retrieve second-hop evidence that the Louvre is located in Paris.

Benchmark results

2WikiMultihopQAExact Match x100: 59.1vs AR W/ RET@1+0.3

HotpotQAExact Match x100: 48.7vs AR W/ RET@1+1.3

CofCAExact Match x100: 45.3vs ReAct+2.4

MuSiQueExact Match x100: 20.6vs AR W/ RET@1+0.8

SynthWorlds-RMExact Match x100: 21.7vs AR W/ RET@1+1.3

GitHub1 repo

pauljngr/SARDIOfficial