The Geometry of Embeddings
All about the geometry and linear algebra behind embeddings.
The Geometry of Embeddings
How to Build Better RAG Using Linear Algebra Author: Brendan Beh, AI.SEA · Shared under AI.SEA's learning commons — please share, not sell.
What Is an Embedding?
An embedding converts words or sentences into vectors — lists of numbers representing their position in a high-dimensional "meaning space". Similar concepts end up close together, even if they share no words in common.
The process: text is tokenised → each token is encoded into a vector by a neural network → the vector captures both meaning and position.
From Simple Counting to Contextual Understanding
| Method | How It Works | Key Limitation |
|---|---|---|
| One-Hot Encoding (OHE) | One '1' per word in a huge vector | No meaning captured; enormous size |
| TF-IDF | Weights terms by frequency vs. corpus rarity | Fails when synonyms share no keywords |
| Word2Vec | Predicts context words (CBOW / Skip-gram) | Static — one vector per word regardless of context |
| BERT | Bidirectional transformer encoder | Fully contextual; 'bank' = different vector in 'river bank' vs 'bank loan' |
Measuring Similarity: Cosine vs. Euclidean
- Euclidean distance — absolute gap between coordinates. Best for physical measurements or ratings where scale matters.
- Cosine similarity — measures the angle between vectors. Best for text: robust to differences in document length or feature scale. Use this by default.
Always L2-normalise (convert vector lengths to 1) before comparing. In high-dimensional space, Euclidean distances clump together — cosine avoids this.
Watch out for anisotropy — when most vectors cluster in a narrow region. Fix: subtract the corpus mean vector, then re-normalise. If still poor, apply PCA whitening.
From Words to Sentences: Pooling
Comparing every word to every word across millions of documents is too slow. Pooling compresses token-level vectors into a single sentence or document vector:
- Mean pooling — average all word vectors. Simple, robust, and the recommended default for sentence-transformer models (e.g.
all-MiniLM-L6-v2). - Weighted mean pooling — give high-IDF or high-attention words more weight.
- CLS token — use the special
[CLS]vector output from BERT-style models. Fast, but mean pooling often outperforms it empirically.
Chunking: Splitting Text Without Breaking Meaning
Long documents must be split before embedding. Too short = noisy and uninformative. Too long = multiple ideas averaged together, losing precision.
| Practice | Recommendation | Why |
|---|---|---|
| Split on | Headings / paragraphs / semantic drift | Respect natural topic boundaries |
| Chunk size | 200–400 tokens (~150–300 words) | Enough info; not too diluted |
| Overlap | 10–15% between consecutive chunks | Preserves context at boundaries |
| Semantic check | cosine < 0.75 → start new chunk | Detects topic drift automatically |
| Normalise | Mean-centre + L2 | Keeps geometry clean |
Avoid "Frankenstein" chunks that mix unrelated topics — the resulting embedding points between meanings and is useless for both. Diagnose with PCA / UMAP scatter plots: blurred edges between clusters signal bad chunking.
Troubleshooting Cheat Sheet
| Symptom | Likely Cause | Fix |
|---|---|---|
| Everything seems similar | Anisotropy / no normalisation | Subtract corpus mean, L2-normalise |
| Unrelated chunks retrieved | Frankenstein chunks | Chunk by headings / semantic drift |
| Same doc appears twice | Overlap too large | Reduce overlap to 10% |
| Cosine scores meaningless | No calibration baseline | Build a small gold evaluation set |
Further Reading
- Harsh Vardhan — A Comprehensive Guide to Word Embeddings in NLP
- Microsoft Ignite — Design and Develop a RAG Solution
- LangChain Docs — Build a RAG Agent
- HuggingFace — LLM Embeddings Explained: A Visual and Intuitive Guide
Attached files