The Geometry of Embeddings

All about the geometry and linear algebra behind embeddings.

B

BrendanCONTRIBUTORMalaysiaUpdated 5 April 2026

The Geometry of Embeddings

How to Build Better RAG Using Linear Algebra Author: Brendan Beh, AI.SEA · Shared under AI.SEA's learning commons — please share, not sell.

What Is an Embedding?

An embedding converts words or sentences into vectors — lists of numbers representing their position in a high-dimensional "meaning space". Similar concepts end up close together, even if they share no words in common.

The process: text is tokenised → each token is encoded into a vector by a neural network → the vector captures both meaning and position.

From Simple Counting to Contextual Understanding

Method	How It Works	Key Limitation
One-Hot Encoding (OHE)	One '1' per word in a huge vector	No meaning captured; enormous size
TF-IDF	Weights terms by frequency vs. corpus rarity	Fails when synonyms share no keywords
Word2Vec	Predicts context words (CBOW / Skip-gram)	Static — one vector per word regardless of context
BERT	Bidirectional transformer encoder	Fully contextual; 'bank' = different vector in 'river bank' vs 'bank loan'

Measuring Similarity: Cosine vs. Euclidean

Euclidean distance — absolute gap between coordinates. Best for physical measurements or ratings where scale matters.
Cosine similarity — measures the angle between vectors. Best for text: robust to differences in document length or feature scale. Use this by default.

Always L2-normalise (convert vector lengths to 1) before comparing. In high-dimensional space, Euclidean distances clump together — cosine avoids this.

Watch out for anisotropy — when most vectors cluster in a narrow region. Fix: subtract the corpus mean vector, then re-normalise. If still poor, apply PCA whitening.

From Words to Sentences: Pooling

Comparing every word to every word across millions of documents is too slow. Pooling compresses token-level vectors into a single sentence or document vector:

Mean pooling — average all word vectors. Simple, robust, and the recommended default for sentence-transformer models (e.g. all-MiniLM-L6-v2).
Weighted mean pooling — give high-IDF or high-attention words more weight.
CLS token — use the special [CLS] vector output from BERT-style models. Fast, but mean pooling often outperforms it empirically.

Chunking: Splitting Text Without Breaking Meaning

Long documents must be split before embedding. Too short = noisy and uninformative. Too long = multiple ideas averaged together, losing precision.

Practice	Recommendation	Why
Split on	Headings / paragraphs / semantic drift	Respect natural topic boundaries
Chunk size	200–400 tokens (~150–300 words)	Enough info; not too diluted
Overlap	10–15% between consecutive chunks	Preserves context at boundaries
Semantic check	cosine < 0.75 → start new chunk	Detects topic drift automatically
Normalise	Mean-centre + L2	Keeps geometry clean

Avoid "Frankenstein" chunks that mix unrelated topics — the resulting embedding points between meanings and is useless for both. Diagnose with PCA / UMAP scatter plots: blurred edges between clusters signal bad chunking.

Troubleshooting Cheat Sheet

Symptom	Likely Cause	Fix
Everything seems similar	Anisotropy / no normalisation	Subtract corpus mean, L2-normalise
Unrelated chunks retrieved	Frankenstein chunks	Chunk by headings / semantic drift
Same doc appears twice	Overlap too large	Reduce overlap to 10%
Cosine scores meaningless	No calibration baseline	Build a small gold evaluation set

The Geometry of Embeddings

The Geometry of Embeddings

What Is an Embedding?

From Simple Counting to Contextual Understanding

Measuring Similarity: Cosine vs. Euclidean

From Words to Sentences: Pooling

Chunking: Splitting Text Without Breaking Meaning

Troubleshooting Cheat Sheet

Further Reading