Domain Adaptation for Embeddings

Domain Adaptation

An embedding model trained on general web text produces a beautifully organized space — for general web text. Point it at clinical notes, legal filings, or financial reports and the geometry quietly stops working: synonyms drift apart, jargon collapses into a blob, retrieval quality craters. Domain adaptation is the work of re-shaping that space to fit a new distribution, ideally without destroying what the model already knew.

The catch is the "without destroying" part. Push the adaptation too hard and the model forgets its source competence — a failure mode with a name, catastrophic forgetting. Much of the field is the management of that one trade-off.

Interactive Adaptation Simulator

The simulator below is real: each domain is a cluster of genuine GloVe vectors, and dragging the strength slider runs CORAL feature alignment live — whitening the source covariance and re-coloring it with the target's. Watch two things move in opposite directions. The alignment loss falls toward zero (good — the domains now share a shape), while the source structure retained score drops (bad — the source's own neighbors scramble). That divergence is catastrophic forgetting, measured.

The domain shift problem

"Shift" is not one thing. Three distinct distributions can move between source and target, and they call for different fixes:

Covariate shift — the inputs change, the labelling rule does not: P_S(x) ≠ P_T(x) while P(y \mid x) is stable. (General → medical text: different words, same notion of relevance.)
Label shift — the class balance changes: P_S(y) ≠ P_T(y) while P(x \mid y) holds.
Concept drift — the rule itself changes over time: P(y \mid x) is no longer fixed.

For embeddings, covariate shift dominates: the target vocabulary simply lives in a different region of space.

Source Domain	Target Domain	Challenge
General Web Text	Medical Records	Specialized terminology
News Articles	Social Media	Informal language
English Reviews	Spanish Reviews	Language + culture
Synthetic Data	Real Sensors	Noise patterns

Measuring the gap

Before adapting, quantify the distance between domains. Two unsupervised measures dominate — both need only features, no labels.

CORAL (CORrelation ALignment) compares second-order statistics: the Frobenius distance between the two covariance matrices. This is exactly the loss the simulator above minimizes.

ℒ_\text{CORAL} = 14d²\,\lVert C_S - C_T \rVert_F²

Its closed-form fix is the whitening-and-recoloring map the strength slider applies — at full strength it makes C_S equal to C_T exactly:

φ(x) = (x - μ_S)\,C_S^-1/2\,C_T^1/2 + μ_T

def coral_loss(source, target):
    """Align second-order statistics — exactly what the simulator does."""
    d = source.size(1)
    cs = torch.cov(source.T)        # source feature covariance
    ct = torch.cov(target.T)        # target feature covariance
    return (cs - ct).pow(2).sum() / (4 * d * d)   # ||C_s - C_t||_F^2

MMD (Maximum Mean Discrepancy) compares the domains in a kernel feature space φ — sensitive to higher moments CORAL ignores:

\mathrm{MMD}²(S, T) = \left\lVert \tfrac{1}{n}Σ_i φ(x_i^S) - \tfrac{1}{m}Σ_j φ(x_j^T) \right\rVert_𝒞h²

Adaptation strategies

Fine-tuning

The blunt instrument: keep training on target data with a smaller learning rate. Effective, but the most prone to forgetting, because every weight is free to move. Gradual unfreezing tames it — thaw layers top-down so the early, general-purpose features survive longest.

def gradual_unfreeze(model, target_loader, epochs_per_stage=2):
    """Thaw top-down so early layers keep their general features longest."""
    for n_unfrozen in range(1, len(model.encoder.layers) + 1):
        for layer in model.encoder.layers[-n_unfrozen:]:
            for p in layer.parameters():
                p.requires_grad = True
        train(model, target_loader, epochs=epochs_per_stage)

Adapter layers

Freeze the backbone entirely; insert a tiny bottleneck module per layer and train only that. A few percent of the parameters move, so the source knowledge is structurally protected — there is no way to overwrite it.

h \;←\; h + W_\text{up}\,σ(W_\text{down}\,h)

class AdapterLayer(nn.Module):
    """Bottleneck adapter: freeze the backbone, train only this."""
    def __init__(self, hidden=768, bottleneck=64):
        super().__init__()
        self.down = nn.Linear(hidden, bottleneck)
        self.up = nn.Linear(bottleneck, hidden)
        self.act = nn.ReLU()

    def forward(self, h):
        return h + self.up(self.act(self.down(h)))   # residual path

Elastic Weight Consolidation (EWC)

Let every weight move, but anchor the ones the source task cared about. The Fisher information F_i estimates each parameter's importance; the penalty charges a quadratic cost for drifting from its source-optimal value θ_S,i^\star. This is the soft, learnable counterpart to the source structure retained meter in the simulator.

ℒ(θ) = ℒ_T(θ) + λ2Σ_i F_i\,(θ_i - θ_S,i^\star)²

def ewc_penalty(model, fisher, theta_star, lam=0.4):
    """Quadratic anchor: the cost of moving weights the source relied on."""
    loss = 0.0
    for name, p in model.named_parameters():
        if name in fisher:
            loss += (fisher[name] * (p - theta_star[name]) ** 2).sum()
    return 0.5 * lam * loss

Domain-adversarial training (DANN)

Train a feature extractor so a domain classifier cannot tell source from target — a gradient-reversal layer flips the discriminator's gradient, pushing features toward domain invariance. No target labels required, which makes it the go-to for fully unsupervised adaptation.

Choosing a strategy

The honest comparison is not "which scores highest" — it is what each method spends to buy target performance.

Aspect

Full fine-tuning

Adapter / EWC

Parameters updated

All of them

A few percent (adapter) · all but penalized (EWC)

Forgetting risk

High — every weight is free to drift

Low — source knowledge protected by construction

Target data needed

Medium to high

Low

Best when

Large target set, source no longer needed

Small target set, must stay strong on source

Catastrophic forgetting

This is the pitfall the simulator is built around. Pure feature alignment achieves a perfect target fit — and pays for it by scrambling the source's own neighbor structure. In a deployed system that shows up as a model that suddenly retrieves brilliantly on the new domain and has quietly gotten worse everywhere else.

The defenses all aim at the same target-vs-source trade-off from different angles:

Adapter layers — make forgetting impossible by freezing the backbone.
EWC — make forgetting expensive via the Fisher-weighted penalty above.
Rehearsal — mix source and target data so the source loss never disappears.
Lower learning rates — move slowly enough that source structure survives.

Two related traps are worth naming. Negative transfer — an unrelated source actively hurts the target — argues for careful source selection and DANN-style invariance. Overfitting to a tiny target set argues for strong regularization, augmentation, and early stopping.

Best practices

Start from a general pre-trained model, prefer a parameter-efficient method when the target set is small, and always monitor both domains — a target gain that tanks source performance is rarely the win it looks like.

Parameter	Recommended Range	Notes
Learning Rate	1e-5 to 1e-4	Lower than source training
Batch Size	8–32	Smaller for limited target data
Epochs	3–10	Avoid overfitting
Warmup Steps	10% of total	Stabilize training
Mix Ratio	0.1–0.3	Source:target rehearsal ratio

Conclusion

Domain adaptation is essential whenever training and deployment distributions differ — which is nearly always. The technique matters less than the discipline: measure the gap (CORAL, MMD), pick a method whose cost you can afford, and watch the source and target metrics together. The simulator makes the core tension concrete — every step toward the target is a step away from the source, and the craft is buying the first without paying the second.

Transformers & LLMs

Vision-Language Adapters: Efficient Fine-tuning

Master LoRA, bottleneck adapters, and prefix tuning for parameter-efficient fine-tuning of vision-language models like LLaVA with minimal compute and memory.

Embeddings & Retrieval

HNSW vs IVF-PQ vs LSH: Approximate Nearest Neighbor Algorithms Compared

How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.

Embeddings & Retrieval

Binary Embeddings for Fast Search

Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.

Embeddings & Retrieval

BM25 Algorithm for Text Retrieval

Master the BM25 algorithm, the probabilistic ranking function powering Elasticsearch and Lucene for keyword-based document retrieval and search systems.

Embeddings & Retrieval

Contrastive Learning

Master contrastive learning for vector embeddings: how InfoNCE loss and self-supervised techniques train models to create high-quality semantic representations.

Embeddings & Retrieval

Cross-Encoder vs Bi-Encoder

Understand the fundamental differences between independent and joint encoding architectures for neural retrieval systems.