How Has LLM Memory And Retrieval Evolved?
What’s come after RAG, and what’s after that
Author: Brendan Beh, AI.SEA | Source: AI.SEA learning commons
Overview
This document traces the evolution of how large language models store and retrieve knowledge, from crude early approaches through to agentic memory systems where the model actively manages its own memory.
Core Concept: Two Types of Memory
Before tracing the eras, it helps to understand the two fundamental ways an LLM can "know" something:
Parametric memory (intrinsic) — knowledge baked into the model's weights during training. Fast, but static and expensive to update.
Non-parametric memory (extrinsic) — knowledge retrieved at runtime from external sources. Dynamic, updatable, and auditable.
Era 1 — Brute Force
Early approaches involved prompt stuffing (dumping everything into the context window), fine-tuning the model on proprietary data, and summarisation loops that compressed and re-injected context. All three approaches were suboptimal: expensive, inflexible, and unable to cite sources or update without retraining. This is what motivated RAG.
Era 2 — Early RAG
Retrieval-Augmented Generation solved several key problems: knowledge could be updated without touching the model, retrieval was selective rather than feeding everything on every call, and grounding on retrieved chunks reduced hallucination. RAG also kept proprietary data out of model training and decoupled retrieval from any specific LLM provider.
However, early RAG had significant failure modes — brittle chunking, a retrieval-recall tradeoff where differently-phrased answers get missed, no temporal awareness, and no memory between sessions. The deeper structural problem: RAG is a lookup table pretending to be a memory system. Chunks are isolated, memory is read-only, and the agent plays no role in reasoning about what to retrieve.
Era 3 — Smarter Storage
Two key advances came in storage architecture:
GraphRAG (Microsoft, 2024) addressed the limitation of standard RAG on "global" queries — questions requiring synthesis across an entire corpus rather than lookup of a specific fact. It builds a knowledge graph from extracted entities and relationships, runs community detection, and generates multi-level hierarchical summaries. Strong performance on global synthesis queries, comparable to standard RAG on local queries, but significantly more expensive to index.
Smarter vector DB usage evolved beyond storing everything blindly, introducing metadata enrichment (timestamps, importance scores, access counts), namespace separation by memory type (episodic vs semantic vs procedural), and pre-write logic to check whether a memory already exists or is worth storing at all.
Graph DBs became relevant for agent memory specifically — where relationships between entities matter as much as the entities themselves. Production systems increasingly combine vector and graph stores: vector search to find the semantic entry point, graph traversal to explore connected context. Examples: Zep, Mem0.
Memory consolidation techniques emerged — progressive compression from raw events to episode summaries to core facts, and reflection, where the agent draws inferences from patterns across memories and stores those inferences as new memories.
Era 4 — Smarter Retrieval
Smarter storage alone wasn't enough. Retrieval remained a hard bottleneck — if the right information isn't surfaced, the LLM never sees it regardless of model capability.
Key retrieval advances:
Hybrid retrieval (dense + sparse) — combining semantic vector search with BM25 keyword matching. Dense retrieval handles conceptual similarity and paraphrasing; sparse retrieval catches exact terms, named entities, and technical identifiers that dense search misses.
Query rewriting — improving the query before retrieval runs, including HyDE (generating a hypothetical answer document and embedding that instead of the query), multi-query retrieval (generating multiple query variants and merging results), step-back prompting (retrieving on the broader underlying concept first), and conversation-aware rewriting to resolve pronouns and implicit references.
Multi-hop retrieval — for questions where the answer requires chaining multiple retrieval steps, each depending on the result of the previous. Implemented via iterative retrieval loops, subgraph traversal in graph stores, or FLARE (where the LLM identifies uncertain parts of its own generation and triggers targeted retrieval for those parts).
Era 5 — Memory as Reasoning
The shift from memory as infrastructure to memory as cognition. The agent is no longer a passive consumer of retrieved context — it becomes an active participant in its own memory management: deciding what to look up, evaluating what it gets back, issuing follow-up queries, choosing what to store, deciding what to forget, and maintaining a persistent model of the world across sessions.
Capabilities of a fully agentic memory system include self-directed retrieval with reformulation, memory writing with judgment, reflection and inference from patterns, active memory maintenance (consolidation, promotion, pruning, conflict resolution), and persistent identity across sessions — each new session builds on accumulated shared context rather than starting cold.
Production-ready today: MemGPT/Letta, LangGraph, Mem0, Zep.
Still research-stage: truly autonomous memory management, multi-agent memory coherence at scale, reliable conflict and decay handling, and robust evaluation frameworks for memory quality.
The Arc in One View
| Era | What Changed |
|---|---|
| 1 | Model knows only what it was trained on |
| 2 | Model can look things up, but doesn't know you |
| 3 | Model can store and retrieve things about you, but retrieval is mechanical |
| 4 | Retrieval gets smarter — hybrid search, re-ranking, multi-hop |
| 5 | Agent reasons about its own memory, builds and maintains a model of you over time |
Attached files