LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu; Yelong Shen; Phillip Wallis; Zeyuan Allen-Zhu; Yuanzhi Li; Shean Wang; Lu Wang; Weizhu Chen

TL;DR

LoRA freezes the pretrained weights W and learns a low-rank update \Delta W = αr B A, where B ∈ ℝ^{d× r} and A ∈ ℝ^{r× k} with r ≪ d,k.
This trains well under 1% of the parameters of full fine-tuning, so adaptation fits on a single GPU and each task’s adapter is a few megabytes.
At inference, BA is merged back into W, adding zero latency versus the base model.
The premise: the weight change needed to adapt a large model has low intrinsic rank.

The low-rank update

Fine-tuning updates every weight in W. LoRA hypothesizes the useful part of that update is low-rank, and parameterizes it as the product of two small matrices:

W' = W + \Delta W = W + αr B A

Only A and B are trained; W stays frozen. The rank r controls the trade-off between capacity and cost.

Parameters, slashed

Because r is tiny, the trainable count r(d+k) is a rounding error next to the full d × k.

Merge for free inference

Since \Delta W = BA has the same shape as W, it can be added into the weights once after training. The deployed model runs at exactly the base model’s cost — no extra layers, no added latency.

LoRA made adapting frontier models accessible. Instead of copying and retraining an entire model per task, you keep one frozen base and swap in small adapters. It became the backbone of the parameter-efficient fine-tuning ecosystem (PEFT, QLoRA), powering everything from instruction tuning to the thousands of community fine-tunes shared as lightweight adapter files.

Attention Is All You Need — LoRA most often adapts the attention projection matrices of this architecture
BERT — the pretrain-then-fine-tune paradigm LoRA makes dramatically cheaper
Switch Transformer — a different efficiency axis: sparse scaling of one big model vs cheap adaptation of a frozen one
FlashAttention — efficiency at the kernel level, complementary to LoRA’s efficiency at the fine-tuning level

LoRA: Low-Rank Adaptation of Large Language Models

TL;DR

The low-rank update

Parameters, slashed

Merge for free inference

Why it mattered

Related Reading