LLMs and transformer models

How LLMs work and how to use them better

B

BrendanCONTRIBUTORMalaysiaUpdated 5 April 2026

LLMs and Transformer Models

What They Do and How to Use Them Better Author: Brendan Beh, AI.SEA · Shared under AI.SEA's learning commons — please share, not sell.

LLMs Are Probability Machines

An LLM doesn't "understand" language the way humans do. Given an input, it produces a probability distribution over all possible next tokens and picks the most likely one — repeatedly, until the response is complete.

User: "What is the meaning of life?" LLM: "The Hitchhiker's Guide To The Universe tells us that the answer is 42."

Each word chosen shapes the probability landscape for the next word. The model isn't retrieving a stored answer — it's constructing one token by token based on likelihood.

Context Defines the Probabilities

The same sentence fragment produces completely different probability distributions depending on what came before it.

"I am heading to the ____" → supermarket, store, town…
"We are out of eggs. I am heading to the ____" → supermarket jumps to the top
"I am heading to the ____ [without the eggs context]" → toilet, bedroom, office become equally likely

If you don't capture enough context, the model can't form good sentences or extract information properly.

Why Older Models Struggled: The RNN Problem

Earlier models like RNNs processed inputs sequentially, passing a hidden state from one token to the next. The problem: older information gets diluted as the chain grows longer. By the time the model reaches token 50, the signal from token 1 has faded.

This made long-range dependencies — like resolving a pronoun back to a noun mentioned many sentences ago — very difficult to capture reliably.

How Transformers Solve It: Parallel Attention

Instead of passing information sequentially, transformers let every word talk to every other word simultaneously. Each token asks: "Which other tokens in this input are relevant to understanding me?"

Because this happens in parallel across the entire input at once, the model can capture relationships between tokens that are far apart — something RNNs fundamentally struggled with.

The Embedding Process

Before any of this can happen, words need to be converted into numbers. This is the embedding process:

Input is broken into tokens (roughly word-sized pieces)
Each token is looked up in an embedding matrix — a giant table where each row is a vector representing that token
A positional encoding is added so the model knows where in the sequence each token sits
The result is a vector that encodes both the meaning of the token and its position

The embedding matrix dimensions:

Width = number of token permutations (the vocabulary size)
Height = number of dimensions in "meaning space"

Words with similar meanings end up in similar coordinates after training — this is an emergent property, not something explicitly programmed in.

💡 Practical tip: Avoid emoji and rare unicode in critical spans. Normalize input where possible. Prefer concise wording and short identifiers — "customer-complaint-resolution-protocol-v7" → "CCR-v7". LLMs tokenize sub-word, so unusual strings can fragment unexpectedly (the famous "strawberry" R-counting failure is a tokenization issue, not a reasoning one).

The Attention Mechanism: Query, Key, Value

The core of the transformer is the QKV attention mechanism. For every token, the model computes three things:

Query (Q) — "What am I looking for?" e.g. "Is there an adjective before this word?"
Key (K) — "What do I offer?" e.g. "I am an adjective, and here's where I am"
Value (V) — "What information should I contribute?" e.g. "Here is the actual adjective"

The attention score between two tokens is computed as a normalized dot product of their Q and K vectors. High overlap = high attention = the Value from that token gets added into the current token's representation.

Formally:

Attention(Q,K,V) = softmax( QKᵀ / √d_k ) · V

Each query is one head of attention. LLMs run many queries in parallel (multi-head attention) to capture different types of relationships simultaneously — grammar, coreference, topic, tone, etc.

💡 Practical tip: QKV matching loves structure. Use strong delimiters (###, triple backticks, YAML blocks) to isolate sections. Wrap context with ### CONTEXT / ### END CONTEXT. Put few-shot examples immediately before the query and mirror the output schema exactly. Use explicit labels (Instruction:, Examples:, Your turn:) to produce distinctive keys.

System: You are a tax summarizer. Output JSON schema below.
### SCHEMA
{ "income": number, "deductions": number, "summary": string }
### EXAMPLES
Input: ...
Output: {"income":..., "deductions":..., "summary":"..."}
### YOUR TURN
Input: ...
Output:

Masking: Keeping Information Flowing One Way

During training, we don't want future tokens to influence the prediction of earlier ones — that would be cheating. The fix is causal masking: the QK overlap for any future token is forced to zero, so attention can only flow from earlier tokens to later ones.

This is why LLMs generate left-to-right — they're architecturally prevented from looking ahead.

The Lost-in-the-Middle Phenomenon

In theory, attention is position-independent — every token can attend to every other. In practice, LLMs systematically prioritise tokens at the head and tail of the prompt, with accuracy dropping for information buried in the middle.

Why this happens:

Causal masking inherently biases attention toward earlier positions in deep networks
Deeper layers have increasingly contextualised views of early tokens, amplifying their influence regardless of semantic content
LLMs are trained on human-written text, which reflects the primacy-recency effect — humans naturally remember the beginning and end of sequences best

The practical consequence: if you have 20 retrieved documents and the answer is in document 10, performance drops significantly compared to it being in document 1 or 20.

Attention Scales as O(N²) — Keep Prompts Lean

The attention matrix grows with the square of the input length. Double your context window, quadruple the compute. More importantly, attention effectiveness degrades as input length increases — the signal gets diluted.

Rather than long system prompts:

Pre-segment long docs (titles + abstracts + chunks)
Replace long boilerplate with variables ({policy}, {schema})
Use headlines, bullet points, and IDs instead of verbose prose
For file intake, chunk with small overlaps (200–400 tokens, 10–15% overlap) and index via embeddings
Prefer tool calls (deterministic APIs) over verbose "reasoning" in-context when possible

How to Arrange Your Prompt

Given the lost-in-the-middle effect, prompt structure matters as much as prompt content. The recommended spine:

Position	What goes here
Top	Role definition + hard rules
Middle	Minimal context (3–5 bullets max)
Middle	Task statement
Middle	Few-shot examples
Bottom	Restate the single most critical constraint

Example:

Top:     "You are a strict validator. Output JSON only."
         [context, 3–5 bullets]
         [task]
         [examples]
Bottom:  "Reminder: JSON only, no prose."

Front-load critical instructions
End-load key reminders
Never bury requirements in the middle
One sentence per rule; use compact bullets and numbered steps

Summary: Practical Checklist

Condense and sanitise your prompts:

Pre-segment long docs (titles + abstracts + chunks)
Replace boilerplate with variables ({policy}, {schema})
Use headlines, bullets, and IDs — not verbose prose
Chunk file intake at 200–400 tokens with 10–15% overlap, index via embeddings
Prefer deterministic tool calls over in-context reasoning where possible
Remove decorative emojis; abbreviate identifiers

Structure prompts to avoid lost-in-the-middle:

Front-load critical instructions; end-load key reminders
Keep prompts concise — never bury requirements in the middle
Repeat the single most critical constraint at the end
Use strong delimiters (###, triple backticks, YAML blocks) to isolate sections

Attached files

Transformer models.pdf4.5 MB · ↗