LLMs and transformer models
How LLMs work and how to use them better
LLMs and Transformer Models
What They Do and How to Use Them Better Author: Brendan Beh, AI.SEA · Shared under AI.SEA's learning commons — please share, not sell.
LLMs Are Probability Machines
An LLM doesn't "understand" language the way humans do. Given an input, it produces a probability distribution over all possible next tokens and picks the most likely one — repeatedly, until the response is complete.
User: "What is the meaning of life?" LLM: "The Hitchhiker's Guide To The Universe tells us that the answer is 42."
Each word chosen shapes the probability landscape for the next word. The model isn't retrieving a stored answer — it's constructing one token by token based on likelihood.
Context Defines the Probabilities
The same sentence fragment produces completely different probability distributions depending on what came before it.
- "I am heading to the ____" → supermarket, store, town…
- "We are out of eggs. I am heading to the ____" → supermarket jumps to the top
- "I am heading to the ____ [without the eggs context]" → toilet, bedroom, office become equally likely
If you don't capture enough context, the model can't form good sentences or extract information properly.
Why Older Models Struggled: The RNN Problem
Earlier models like RNNs processed inputs sequentially, passing a hidden state from one token to the next. The problem: older information gets diluted as the chain grows longer. By the time the model reaches token 50, the signal from token 1 has faded.
This made long-range dependencies — like resolving a pronoun back to a noun mentioned many sentences ago — very difficult to capture reliably.
How Transformers Solve It: Parallel Attention
Instead of passing information sequentially, transformers let every word talk to every other word simultaneously. Each token asks: "Which other tokens in this input are relevant to understanding me?"
Because this happens in parallel across the entire input at once, the model can capture relationships between tokens that are far apart — something RNNs fundamentally struggled with.
The Embedding Process
Before any of this can happen, words need to be converted into numbers. This is the embedding process:
- Input is broken into tokens (roughly word-sized pieces)
- Each token is looked up in an embedding matrix — a giant table where each row is a vector representing that token
- A positional encoding is added so the model knows where in the sequence each token sits
- The result is a vector that encodes both the meaning of the token and its position
The embedding matrix dimensions:
- Width = number of token permutations (the vocabulary size)
- Height = number of dimensions in "meaning space"
Words with similar meanings end up in similar coordinates after training — this is an emergent property, not something explicitly programmed in.
💡 Practical tip: Avoid emoji and rare unicode in critical spans. Normalize input where possible. Prefer concise wording and short identifiers —
"customer-complaint-resolution-protocol-v7"→"CCR-v7". LLMs tokenize sub-word, so unusual strings can fragment unexpectedly (the famous "strawberry" R-counting failure is a tokenization issue, not a reasoning one).
The Attention Mechanism: Query, Key, Value
The core of the transformer is the QKV attention mechanism. For every token, the model computes three things:
- Query (Q) — "What am I looking for?" e.g. "Is there an adjective before this word?"
- Key (K) — "What do I offer?" e.g. "I am an adjective, and here's where I am"
- Value (V) — "What information should I contribute?" e.g. "Here is the actual adjective"
The attention score between two tokens is computed as a normalized dot product of their Q and K vectors. High overlap = high attention = the Value from that token gets added into the current token's representation.
Formally:
Attention(Q,K,V) = softmax( QKᵀ / √d_k ) · V
Each query is one head of attention. LLMs run many queries in parallel (multi-head attention) to capture different types of relationships simultaneously — grammar, coreference, topic, tone, etc.
💡 Practical tip: QKV matching loves structure. Use strong delimiters (
###, triple backticks, YAML blocks) to isolate sections. Wrap context with### CONTEXT / ### END CONTEXT. Put few-shot examples immediately before the query and mirror the output schema exactly. Use explicit labels (Instruction:,Examples:,Your turn:) to produce distinctive keys.
System: You are a tax summarizer. Output JSON schema below.
### SCHEMA
{ "income": number, "deductions": number, "summary": string }
### EXAMPLES
Input: ...
Output: {"income":..., "deductions":..., "summary":"..."}
### YOUR TURN
Input: ...
Output:
Masking: Keeping Information Flowing One Way
During training, we don't want future tokens to influence the prediction of earlier ones — that would be cheating. The fix is causal masking: the QK overlap for any future token is forced to zero, so attention can only flow from earlier tokens to later ones.
This is why LLMs generate left-to-right — they're architecturally prevented from looking ahead.
The Lost-in-the-Middle Phenomenon
In theory, attention is position-independent — every token can attend to every other. In practice, LLMs systematically prioritise tokens at the head and tail of the prompt, with accuracy dropping for information buried in the middle.
Why this happens:
- Causal masking inherently biases attention toward earlier positions in deep networks
- Deeper layers have increasingly contextualised views of early tokens, amplifying their influence regardless of semantic content
- LLMs are trained on human-written text, which reflects the primacy-recency effect — humans naturally remember the beginning and end of sequences best
The practical consequence: if you have 20 retrieved documents and the answer is in document 10, performance drops significantly compared to it being in document 1 or 20.
Attention Scales as O(N²) — Keep Prompts Lean
The attention matrix grows with the square of the input length. Double your context window, quadruple the compute. More importantly, attention effectiveness degrades as input length increases — the signal gets diluted.
Rather than long system prompts:
- Pre-segment long docs (titles + abstracts + chunks)
- Replace long boilerplate with variables (
{policy},{schema}) - Use headlines, bullet points, and IDs instead of verbose prose
- For file intake, chunk with small overlaps (200–400 tokens, 10–15% overlap) and index via embeddings
- Prefer tool calls (deterministic APIs) over verbose "reasoning" in-context when possible
How to Arrange Your Prompt
Given the lost-in-the-middle effect, prompt structure matters as much as prompt content. The recommended spine:
| Position | What goes here |
|---|---|
| Top | Role definition + hard rules |
| Middle | Minimal context (3–5 bullets max) |
| Middle | Task statement |
| Middle | Few-shot examples |
| Bottom | Restate the single most critical constraint |
Example:
Top: "You are a strict validator. Output JSON only."
[context, 3–5 bullets]
[task]
[examples]
Bottom: "Reminder: JSON only, no prose."
- Front-load critical instructions
- End-load key reminders
- Never bury requirements in the middle
- One sentence per rule; use compact bullets and numbered steps
Summary: Practical Checklist
Condense and sanitise your prompts:
- Pre-segment long docs (titles + abstracts + chunks)
- Replace boilerplate with variables (
{policy},{schema}) - Use headlines, bullets, and IDs — not verbose prose
- Chunk file intake at 200–400 tokens with 10–15% overlap, index via embeddings
- Prefer deterministic tool calls over in-context reasoning where possible
- Remove decorative emojis; abbreviate identifiers
Structure prompts to avoid lost-in-the-middle:
- Front-load critical instructions; end-load key reminders
- Keep prompts concise — never bury requirements in the middle
- Repeat the single most critical constraint at the end
- Use strong delimiters (
###, triple backticks, YAML blocks) to isolate sections
Attached files