Skip to main content

LoRA: Low-Rank Adaptation of Large Language Models

How LoRA adapts a frozen large language model by learning a low-rank update ΔW = (α/r)·BA to each weight matrix, training under 1% of the parameters of full fine-tuning and adding zero inference latency once merged.

TL;DR

  • LoRA freezes the pretrained weights W and learns a low-rank update \Delta W = αr B A, where B ∈ ℝd× r and A ∈ ℝr× k with r ≪ d,k.
  • This trains well under 1% of the parameters of full fine-tuning, so adaptation fits on a single GPU and each task’s adapter is a few megabytes.
  • At inference, BA is merged back into W, adding zero latency versus the base model.
  • The premise: the weight change needed to adapt a large model has low intrinsic rank.

The low-rank update

Fine-tuning updates every weight in W. LoRA hypothesizes the useful part of that update is low-rank, and parameterizes it as the product of two small matrices:

W' = W + \Delta W = W + αr B A

Only A and B are trained; W stays frozen. The rank r controls the trade-off between capacity and cost.

Parameters, slashed

Because r is tiny, the trainable count r(d+k) is a rounding error next to the full d × k.

Merge for free inference

Since \Delta W = BA has the same shape as W, it can be added into the weights once after training. The deployed model runs at exactly the base model’s cost — no extra layers, no added latency.

Why it mattered

LoRA made adapting frontier models accessible. Instead of copying and retraining an entire model per task, you keep one frozen base and swap in small adapters. It became the backbone of the parameter-efficient fine-tuning ecosystem (PEFT, QLoRA), powering everything from instruction tuning to the thousands of community fine-tunes shared as lightweight adapter files.

  • Attention Is All You Need — LoRA most often adapts the attention projection matrices of this architecture
  • BERT — the pretrain-then-fine-tune paradigm LoRA makes dramatically cheaper
  • Switch Transformer — a different efficiency axis: sparse scaling of one big model vs cheap adaptation of a frozen one
  • FlashAttention — efficiency at the kernel level, complementary to LoRA’s efficiency at the fine-tuning level

If you found this paper review helpful, consider sharing it with others.

Mastodon