SigLIP: Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai; Basil Mustafa; Alexander Kolesnikov; Lucas Beyer

TL;DR

SigLIP replaces CLIP’s softmax (InfoNCE) loss with a pairwise sigmoid loss that treats every image–text pair as an independent binary classification, removing the need for a global normalization step.
Because there is no row-wise softmax, there is no need for an all-gather across devices — the loss is computed locally on each accelerator, unlocking far simpler and more scalable training.
SigLIP matches or surpasses CLIP at much smaller batch sizes (as low as 256 vs. CLIP’s typical 32K+), making strong vision–language pre-training accessible without massive infrastructure.
A learnable bias and temperature address the natural 1-positive vs. (N−1)-negatives imbalance, preventing the model from collapsing at initialization.

The cost of softmax

CLIP’s contrastive loss is the InfoNCE objective. For a batch of image–text pairs, the image-side loss is:

Loss = − ¹⁄_N Σ_i=1^N log

e^sim(I_i,T_i)

Σ_j=1^N e^sim(I_i,T_j)

where and are L2-normalised image and text embeddings, and is a learned temperature. A symmetric text-side loss is averaged in.

The denominator is the critical problem: computing requires every text embedding in the batch. When training is distributed across many accelerators, this means performing an expensive all-gather communication collective to assemble the full batch on every device before the loss can be computed. Larger batches improve accuracy but compound this communication cost quadratically. The result is a tight coupling between batch size, number of devices, and per-step training time.

Sigmoid loss

SigLIP proposes a replacement that breaks this coupling. Instead of an N-way classification, it casts each of the possible image–text pairs as an independent binary classification: does this pair match (+1) or not (−1)?

Loss = − ¹⁄_N Σ_i=1^N log

e^sim(I_i,T_i)

Σ_j=1^N e^sim(I_i,T_j)

where if (matching pair) and otherwise, is a learned temperature (equivalently, a logit scale), and is a learned bias. Each term depends only on a single pair of embeddings — there is no normalization across rows or columns, so no all-gather is required.

The demo above makes the structural difference concrete. In softmax mode, shading any one cell requires the full row (all text embeddings) to be present before the denominator can be evaluated. In sigmoid mode, each cell’s value is determined solely by the embeddings of image and text , with no dependence on any other pair. This is what makes the loss embarrassingly parallel.

Batch size, freed

Because each pair contributes independently to the gradient, SigLIP does not need large batches to form a meaningful signal. CLIP required batches of 32K or more to have enough negatives to make the softmax informative; SigLIP produces usable gradients from batches of 256 because every pair — positive or negative — contributes directly and independently.

In practice, the paper shows that SigLIP with a batch size of 16K matches the zero-shot performance of CLIP at 32K, and continues to perform competitively even at 1K. Smaller batches reduce memory requirements and remove the all-gather bottleneck, making the training loop both cheaper and simpler to implement across heterogeneous hardware.

Taming the imbalance

Removing the softmax normalization introduces a new challenge. At initialization, the model assigns roughly equal similarities to all pairs. But in a batch of pairs there is exactly 1 positive per row and N−1 negatives. With random initialisation, every pair scores near 0.5 under the sigmoid, so the loss receives a flood of near-equal gradient contributions from negatives and almost nothing from positives. Left unchecked, the model can stagnate or diverge early in training.

SigLIP addresses this with the learnable bias . Initialised to a large negative value (around −10), it shifts the sigmoid so that all pairs start with probabilities close to zero. This mirrors the prior probability of a random pair being positive, which is . Negative pairs are correctly initialised near 0, and positive pairs receive a strong gradient signal right from the start. The temperature (logit scale) controls the sharpness of the sigmoid and is learned jointly.

The bias slider in the demo shows this effect directly. At , the decision boundary sits at (cosine similarity = 0), and the curve is broadly centred — negatives receive large gradients. At , the curve shifts left so that only pairs with high similarity (plausibly positives) score above 0.5, rebalancing the gradient contributions.

Why it mattered

SigLIP arrived at a moment when the community had assumed large batch sizes were a fundamental requirement for contrastive vision–language training. It demonstrated that the requirement was an artifact of the softmax normalisation, not an intrinsic property of the task. The resulting model trains faster per step, scales more gracefully to large numbers of accelerators (no all-gather), and reaches comparable quality at smaller batch sizes.

SigLIP has become a default replacement for CLIP encoders in downstream vision–language models. PaLI-X, Gemini’s vision encoder, and several open-source VLMs (LLaVA, InternVL) now train or initialise from SigLIP checkpoints. The simplicity of the sigmoid formulation also makes it straightforward to extend: the loss can be applied asymmetrically, combined with masked prediction heads, or adapted to multi-label settings without the structural constraints imposed by softmax normalisation.

CLIP — the softmax contrastive predecessor that SigLIP replaces, establishing the dual-encoder vision–language framework
CoCa — combines contrastive and captioning losses in a single model, complementing SigLIP’s efficiency-focused approach
BLIP-2 — bridges frozen vision encoders (often SigLIP-based) with large language models via a lightweight Q-Former