LoRA — Low-Rank Adaptation
A brief guide to LoRA
LoRA is a parameter-efficient fine-tuning technique that lets you adapt a pretrained language model to a specific task or behaviour without updating all of its weights. Instead of modifying the full model, LoRA trains two small matrices alongside the frozen base and adds their product as a correction to the original weights at inference time.
It is currently the dominant approach for practical fine-tuning of open models. Most serious fine-tuning work you'll encounter in the wild uses LoRA or its memory-efficient variant, QLoRA.
Why LoRA exists
Full fine-tuning updates every parameter in the model. For a 7B model at 16-bit precision, that means storing the weights (~14GB), gradients (~14GB), and optimiser states (~28GB) simultaneously in GPU memory — around 56GB before activations. Consumer GPUs top out at 24GB. Most people don't have access to A100-class hardware.
The insight behind LoRA, introduced by Hu et al. (2021), is that the weight updates produced during fine-tuning have a low intrinsic rank. Even though the weight matrices are large — a 7B model's attention projections are typically 4096 × 4096 — the meaningful change needed to adapt the model to a new task lives in a much smaller subspace. You don't need to update all 16.7 million parameters in a single matrix. You need to update the directions that matter.
LoRA directly parameterises that subspace.
The mechanics
LoRA freezes the original weight matrix W and introduces two trainable matrices A and B. The weight update ΔW is expressed as their product:
W' = W + ΔW ΔW = B · A
where: W ∈ ℝᵐˣⁿ — original frozen weight matrix A ∈ ℝʳˣⁿ — small trainable matrix (random initialisation) B ∈ ℝᵐˣʳ — small trainable matrix (zero initialisation) r — rank (chosen by the practitioner: 4, 8, 16, 64...)
Because B is initialised to zero, ΔW = 0 at the start of training. The model begins behaving identically to the base model, and training diverges from that stable starting point. A is initialised with small random values so gradients can flow immediately.
During training, only A and B receive gradient updates. W stays frozen. At inference time, B·A is added to W — or merged into it permanently for deployment with no additional latency cost.
The memory saving is substantial. With r = 8 and m = n = 4096:
- Full weight matrix: 4096 × 4096 = 16.7 million parameters
- LoRA adapters: (8 × 4096) + (4096 × 8) = 65,536 parameters
- That is 0.4% of the original
Because only the adapter weights are trained, gradients and optimiser states shrink by the same factor. A 7B fine-tuning job that would require ~56GB in full fine-tuning mode drops to roughly 16–18GB with LoRA — within reach of a consumer RTX 4090.
The three knobs
When configuring a LoRA run, three parameters matter most.
Rank (r)
The bottleneck dimension — how many independent directions the update is allowed to express. Higher rank means more expressive adapters, more parameters, more memory.
| Value | Character |
|---|---|
| r = 4 | Minimal. Fast. Good for simple format/style tasks. |
| r = 8 | Common default. Works for most tasks. |
| r = 16–32 | More capacity. Use when r=8 underfits. |
| r = 64–128 | High capacity. Approaching full fine-tuning territory. |
Start at r = 8. Increase only if training loss plateaus and you have evidence the model isn't expressive enough.
How to choose rank in practice
The honest answer is that rank selection is empirical — you don't know the right value before training, you discover it by watching what happens. But there are intuitions that get you to a reasonable starting point without blind guessing.
The core intuition
Rank is a measure of how different the behaviour you want is from what the base model already does. Simple task, small distribution shift, low rank. Complex task, large distribution shift, higher rank.
Think about it this way: if you're teaching a model to always respond in JSON format, that's a narrow behavioural change — the model already knows JSON, you're just making it the default output mode. r = 4 or r = 8 is probably enough. If you're teaching a model to reason like a domain expert in a specialised field using terminology and conventions it's only seen rarely, you're asking for a much larger shift. r = 32 or r = 64 gives more capacity for that.
A rough heuristic by task type
| Task | Intuition | Starting rank |
|---|---|---|
| Output format / structure (JSON, markdown, diffs) | Narrow, surface-level change | r = 4–8 |
| Tone and style consistency | Moderate — behavioural, not factual | r = 8–16 |
| Domain-specific response patterns | Larger shift, more nuance | r = 16–32 |
| Highly specialised behaviour, complex reasoning style | Large shift | r = 32–64 |
| Approaching full fine-tuning territory | Ask whether LoRA is the right tool | r = 128+ |
What the training loss tells you
The practical signal for rank is whether your training loss converges cleanly. If loss stops decreasing early and plateaus high, your adapters may not have enough capacity — try doubling r. If loss drops fast and then starts climbing again (overfitting), your rank may be too high for the dataset size — reduce r or add more data.
A loss curve that descends smoothly and levels off near zero on training data but stays reasonable on validation is the target shape. You're looking for that curve, not a specific loss number.
Dataset size and rank are linked
More rank means more trainable parameters, which means the model needs more data to learn them reliably. A rough rule of thumb: don't push rank above r = 16 with fewer than a few hundred examples, and don't go above r = 64 without at least several thousand high-quality examples. A small dataset with high rank will overfit — the adapters will memorise your examples rather than generalise from them.
The practical starting protocol
- Start at r = 8, α = 16, train for a few epochs
- Look at the loss curve — does it converge?
- Run inference on held-out examples — does the behaviour look right?
- If the model underfits (loss won't go low, behaviour is weak), double r and retrain
- If the model overfits (train loss low, validation loss climbing, behaviour is brittle), reduce r or get more data
- If it works, ship it — don't keep increasing rank looking for marginal gains
The temptation is to treat higher rank as always better. It isn't. A well-trained r = 8 adapter on a clean dataset will outperform a poorly trained r = 64 adapter on a noisy one. Rank gives capacity. Data gives signal. You need both.
Alpha (α)
A scaling factor applied to the update. The effective weight change is (α/r) · B · A. Alpha controls how strongly the adapters influence the output relative to the frozen base.
In practice: set α = r (scale = 1.0) or α = 2r (scale = 2.0) and leave it. Don't optimise alpha until you've exhausted other variables.
target_modules
Which weight matrices inside the model receive LoRA adapters. You don't apply LoRA to every layer — typically just the attention projections.
Common targets:
q_proj— query projectionk_proj— key projectionv_proj— value projectiono_proj— output projection
Some practitioners also target the feed-forward layers (gate_proj, up_proj, down_proj). This increases the number of trainable parameters and can improve results on tasks requiring more factual adaptation, at the cost of higher memory use.
QLoRA — LoRA on consumer hardware
QLoRA (Dettmers et al., 2023) extends LoRA with 4-bit quantisation of the base model weights, enabling fine-tuning of large models on hardware that would otherwise be completely infeasible.
How it works
The base model weights are compressed from 16-bit floats to 4-bit integers using NF4 (Normal Float 4-bit) — a quantisation format specifically optimised for the normal distribution that pretrained weights follow. A 7B model at 16-bit takes ~14GB. At 4-bit, it takes ~3.5GB.
The LoRA adapters themselves remain in full 16-bit precision. Only the frozen base is compressed.
During the forward pass, weights are dequantised on the fly from 4-bit back to 16-bit for computation, then discarded. The stored copy stays compressed. This adds a small computational overhead but keeps the memory footprint low throughout training.
Double quantisation further compresses the quantisation scaling factors themselves from 32-bit to 8-bit, recovering an additional ~0.37GB on a 7B model.
Paged optimisers handle memory spikes during training by spilling optimiser states to CPU RAM temporarily when GPU memory is exhausted, then paging them back as needed.
The precision tradeoff
4-bit quantisation introduces a small rounding error — each weight is mapped to the nearest of 16 NF4 bucket values rather than stored exactly. In practice this is negligible for most fine-tuning tasks:
- The base weights are frozen, so you're not training through the noise — just reading through it
- The LoRA adapters are trained in full precision and learn to compensate
- Empirically, QLoRA matches full 16-bit LoRA performance on most benchmarks (Dettmers et al., 2023)
When to use which
| Situation | Recommendation |
|---|---|
| Single consumer GPU (RTX 3090, 4090, Colab T4) | QLoRA — the only viable option |
| A100 40GB, fine-tuning ≤13B model | LoRA in 16-bit |
| A100 80GB or multi-GPU | LoRA in 16-bit, or full fine-tuning if dataset warrants |
| First experiment, hardware unclear | Default to QLoRA |
For a first fine-tuning experiment the quality difference between LoRA and QLoRA is not meaningful. Default to QLoRA and move on.
What LoRA cannot do
Understanding the limits is as important as understanding the technique.
It cannot inject new factual knowledge. LoRA reshapes how the model behaves — not what it knows. Fine-tuning on internal documents won't reliably make the model recall those documents at inference time. For factual grounding, use RAG. LoRA is for style, format, and behaviour.
It can cause catastrophic forgetting. If the training dataset is too narrow or training runs too long, the model can overfit and lose general capability. It becomes very good at the specific task and noticeably worse at everything else. Low rank and early stopping are the main defences.
Rank is a ceiling, not a guarantee. Setting r = 64 doesn't mean your update will use all 64 directions. If the dataset lacks sufficient variety, the adapters learn fewer effective directions regardless of the capacity you've allocated. More rank does not compensate for a small or inconsistent dataset.
It doesn't fix bad data. LoRA amplifies whatever signal is in the training examples — including noise, inconsistency, and errors. A fine-tuned model is only as good as the data it was trained on. Data preparation is the hardest and most consequential part of the process.
LoRA in the tool stack
LoRA is a technique, not a tool. It's implemented across several layers of the fine-tuning stack:
Foundation (Layer 1)
HuggingFace PEFT— the canonical open-source implementation of LoRA, adapters, and prompt tuning. If you want to understand how LoRA works in code, this is the reference.TRL (SFTTrainer)— HuggingFace's training loop library. Handles the SFT objective. Takes PEFT-configured models as input.
Efficiency (Layer 2)
Unsloth— rewrites attention and matrix kernels in custom CUDA/Triton. 2× faster training, ~70% less VRAM, same numerical results. Drop-in compatible with HuggingFace. The standard efficiency layer for most practical LoRA runs.BitsAndBytes— the quantisation library that makes QLoRA possible. Handles 4-bit and 8-bit loading of base model weights.
Orchestration (Layer 3)
Axolotl— config-driven training framework. Declare your base model, dataset, LoRA parameters, and hyperparams in a YAML file; Axolotl assembles the training loop and calls the layers below. The standard choice for reproducible, multi-run experiments.
Managed platforms (Layer 4)
Together AI,Fireworks AI— managed fine-tuning APIs. Upload a dataset, pick a base model, receive a deployed endpoint. No GPU management. Higher cost per run, lower operational friction.
A typical first LoRA run: Google Colab (free T4 GPU) + Unsloth + HuggingFace PEFT + a Llama or Qwen 3B–8B model. This costs nothing and teaches the full loop end-to-end.
Relationship to SFT
LoRA and SFT are frequently conflated. They are not alternatives — they operate on different axes.
SFT (Supervised Fine-Tuning) is the training objective: given an input, predict the correct output, minimise cross-entropy loss. It defines what you're optimising toward.
LoRA is the parameter efficiency strategy: which weights are allowed to update, and how. It defines where changes happen.
A typical fine-tuning job does both simultaneously: SFT as the objective, LoRA as the efficiency wrapper. When someone says "I LoRA fine-tuned my model," they almost certainly mean SFT + LoRA.
Key configuration reference
# Axolotl config excerpt
base_model: meta-llama/Meta-Llama-3-8B-Instruct
adapter: lora
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
load_in_4bit: true # enables QLoRA
bf16: true # adapter precision
gradient_checkpointing: true
learning_rate: 2e-4
num_epochs: 3
Further reading
- Hu et al. (2021) — LoRA: Low-Rank Adaptation of Large Language Models (the original paper)
- Dettmers et al. (2023) — QLoRA: Efficient Finetuning of Quantized LLMs (the QLoRA paper)
- HuggingFace PEFT documentation — practical implementation reference
- Unsloth documentation — optimised LoRA/QLoRA training