Segment Anything Model (SAM)

Alexander Kirillov; Eric Mintun; Trevor Darrell; Ross Girshick; Piotr Dollár

TL;DR

SAM frames image segmentation as a promptable task: given an image and a prompt (point, box, mask, or text), produce a valid segmentation mask. A ViT-H image encoder computes the image embedding once, then a lightweight mask decoder produces masks in real time (~50ms) for any prompt. Trained on SA-1B — 1.1 billion masks across 11 million images, built via a three-stage data engine — SAM achieves strong zero-shot transfer to unseen tasks and distributions without fine-tuning. The contribution is not a single architectural novelty but the combination of task definition, data engine, and scale that makes segmentation a foundation-model problem.

The Foundation Model Framing

SAM applies the NLP foundation model playbook to segmentation. In NLP, GPT and BERT defined broad pretraining tasks (next-token prediction, masked language modeling) that transfer to diverse downstream tasks via prompting. SAM does the same for segmentation: define a single task general enough to serve as pretraining, train at massive scale, then transfer via prompt engineering at inference.

The key question is: what is the right "pre-trainable" task for segmentation? The authors propose promptable segmentation — given any segmentation prompt, return a valid mask. This is deliberately underspecified: the prompt may be ambiguous (a point on an object could mean the part or the whole), so the model must handle ambiguity gracefully rather than forcing a single interpretation.

This framing has a subtle but important consequence for the training objective. Traditional segmentation models optimize for a fixed label set on a fixed dataset. SAM instead optimizes for prompt-conditional mask prediction, which means the model must learn a general correspondence between spatial prompts and object boundaries rather than memorizing category-specific segmentation patterns.

The Promptable Segmentation Task

Formally, the task is a function f: (\mathbf{I}, \mathbf{p}) → \mathbf{M} that maps an image \mathbf{I} and a prompt \mathbf{p} to a set of valid segmentation masks \mathbf{M}. "Valid" means any mask that a reasonable annotator would produce given the same prompt — the model is not required to guess the user's intent when the prompt is ambiguous.

This task subsumes several existing segmentation tasks. Interactive segmentation (clicks to masks), edge detection (dense points to boundaries), object proposal generation (grid of prompts to candidate masks), and instance segmentation (box prompts to masks) are all special cases of promptable segmentation with different prompt types. This generality is what makes it suitable as a pretraining task — a model that solves promptable segmentation well has implicitly learned the sub-skills needed for all these downstream tasks.

Architecture: Three Components

SAM decomposes into three modules with an asymmetric compute design: a heavy image encoder that runs once per image, and lightweight prompt encoder + mask decoder that run per prompt.

Image Encoder (ViT-H). A Vision Transformer pretrained with MAE (Masked Autoencoder), specifically ViT-Huge (632M parameters, 32 transformer blocks, embedding dimension 1280). The architecture uses 14×14 windowed attention in most blocks with four interleaved global attention blocks to capture long-range dependencies. The input image is resized to 1024×1024 and the encoder produces a 64 × 64 × 256 feature map (after a neck that reduces the channel dimension from 1280 to 256). This is the expensive step (~0.15s on an A100), but it only runs once per image. All subsequent prompt interactions reuse this embedding.

Prompt Encoder. Handles two categories of prompts:

Sparse prompts (points, boxes, text): mapped to 256-d embedding vectors via learned positional encodings. Points use two learned embeddings (foreground/background) summed with positional encodings. Boxes are encoded as two points (top-left, bottom-right). Text prompts use the CLIP text encoder.
Dense prompts (masks): downscaled to 256 × 256 and mapped to a 256-d spatial embedding via two 2 × 2 convolutions with stride 2, producing a 64 × 64 × 256 feature map that is element-wise summed with the image embedding.

Mask Decoder. A modified transformer decoder with only two blocks — deliberately kept small so that the prompt-to-mask step is fast. Each block performs four operations: (1) self-attention on prompt tokens, (2) cross-attention from prompt tokens to the image embedding, (3) point-wise MLP on each token, (4) cross-attention from the image embedding to prompt tokens. The bidirectional cross-attention (steps 2 and 4) is the key design choice: it lets the image features attend to prompt information and vice versa, unlike a standard decoder that only attends in one direction.

After the two blocks, the image embedding is upsampled 4× via two transposed convolution layers (each 2×) and each output token produces a mask prediction via a spatial dot product with the upsampled features, followed by a per-pixel sigmoid. The entire decoder has only ~4M parameters and runs in ~50ms on a GPU, enabling real-time interactive segmentation.

Ambiguity-Aware Output

A single point prompt is inherently ambiguous — it could refer to a subpart, the whole object, or the entire scene. SAM handles this by predicting three masks simultaneously (whole, part, subpart), each with an associated IoU confidence score. For a deeper exploration of how this multi-mask strategy resolves ambiguity in practice, see the SAM multi-mask ambiguity article. During training, only the mask with the lowest loss against the ground truth receives gradients:

ℒ = min_{i ∈ \{1,2,3\}} [ λ_\text{focal} · ℒ_\text{focal}(m̂_i, m^*) + λ_\text{dice} · ℒ_\text{dice}(m̂_i, m^*) + ℒ_\text{IoU}(ŝ_i, \text{IoU}(m̂_i, m^*)) ]

where m̂_i is the i-th predicted mask, m^* is the ground truth, ŝ_i is the predicted IoU score, and the loss combines focal loss, dice loss, and a mean-squared-error IoU prediction loss. The min-over-masks strategy avoids averaging across ambiguous interpretations, letting each mask head specialize in a different granularity. At inference, the mask with the highest predicted IoU is selected by default, though all three can be returned for applications that want multi-granularity output.

Training uses mixed prompt simulation: during each iteration, prompts are randomly sampled as points, boxes, or masks (with the first interaction simulated from ground truth with added noise). The model is trained for 11 iterations of prompt refinement per sample, where the output mask from the previous iteration becomes the mask prompt for the next, teaching SAM to iteratively refine predictions.

The Data Engine: Three Stages

The SA-1B dataset was not hand-curated from scratch. Instead, the authors built a data engine — a model-in-the-loop annotation pipeline that progressively reduces human effort across three stages:

Assisted-Manual (Stage 1). Annotators label masks using a browser-based tool powered by an early SAM model. SAM proposes masks, annotators correct them. 120k images, 4.3M masks. Average annotation time: 34 seconds per mask.
Semi-Automatic (Stage 2). SAM generates confident masks automatically; annotators label only the remaining unannotated objects. This increases object diversity by focusing human effort on objects SAM missed. 180k images, 5.9M masks (both auto and manual). Annotation time dropped to 14 seconds per mask.
Fully Automatic (Stage 3). A 32×32 grid of point prompts (1,024 points per image) is applied. For each point, the model predicts three masks at different granularities. Masks are filtered by predicted IoU confidence (threshold 0.88), deduplicated via NMS with an IoU threshold, and stability-filtered by checking whether the mask changes under small perturbations of the logit threshold. No human annotation. 11M images, 1.1B masks. This stage produces the bulk of SA-1B.

The final SA-1B dataset contains 1.1 billion masks on 11 million licensed images — 400× more masks than the next largest segmentation dataset (Open Images V5, 2.8M masks). The median image contains ~100 masks, with substantially better coverage of small and medium objects compared to prior datasets. Human quality ratings showed 94% of automatically generated masks had IoU > 0.90 compared to professional annotations.

Key Results

Zero-shot single-point segmentation. SAM achieves a mean IoU of 55.8 on a 23-dataset benchmark using a single foreground point, outperforming RITM (a strong interactive segmentation baseline trained on each dataset) on 16 of 23 datasets. When evaluated with oracle multi-point selection, performance reaches 73.0 mIoU.

Zero-shot edge detection. On BSDS500, SAM produces edge maps (from predicted mask boundaries) with an F1 score of 0.768 without any edge-specific training — competitive with dedicated edge detectors.

Zero-shot object proposals. On LVIS v1, SAM generates object proposals with an average recall (AR@1000) of 75.7 at all scales, outperforming ViTDet-H (the supervised baseline) on medium and large objects while underperforming on small objects.

Zero-shot instance segmentation. Using detected boxes from ViTDet as prompts, SAM produces masks that score within 2–4 mAP points of ViTDet's own mask head on COCO and LVIS, despite never training on these datasets.

Ablation highlights. The authors ablate encoder size (ViT-B/L/H), finding consistent gains with scale — ViT-H improves 2.4 mIoU over ViT-B on the 23-dataset zero-shot benchmark. Data scale also matters: training on 0.1× SA-1B data degrades single-point mIoU by ~2 points, confirming the data engine's value beyond the architecture itself.

Critical Analysis

Strengths.

Zero-shot generalization. SAM transfers to new domains (medical imaging, satellite imagery, underwater photos) without fine-tuning. This is a direct consequence of data scale and the promptable task formulation.
Amortized compute via decoupled architecture. The heavy ViT-H encoder runs once; the mask decoder runs in ~50ms. This enables interactive annotation tools where a user issues dozens of prompts on the same image.
Data engine as a contribution. The three-stage annotation pipeline is a reusable methodology. It demonstrates that model-assisted annotation can bootstrap a billion-scale dataset with quality comparable to manual labeling.

Limitations.

No semantic labels. SAM produces class-agnostic masks. It can segment an object but cannot tell you what it is. Downstream applications must combine SAM with a classifier or use an extension like Grounded-SAM.
Struggles with thin structures. Fine-grained boundaries (bicycle spokes, fences, hair) are systematically underrepresented in training data and difficult for the 64×64 bottleneck to resolve. Mask quality degrades on these cases.
Requires prompts at inference. Unlike fully automatic panoptic segmentation models, SAM does not segment an image without at least one prompt. The automatic mode (grid of points + filtering) is a workaround, not a principled solution.
SA-1B distribution gaps. Despite its size, SA-1B is heavily biased toward everyday photography. Performance drops on specialized domains (medical histology, aerial imagery) where the visual statistics differ from the training distribution.
No temporal reasoning. SAM operates on single images. Video segmentation, where temporal coherence matters, is out of scope for this version.

SAM 2 and Follow-Up Work

SAM 2 (Ravi et al., 2024) extends SAM to video by adding a memory mechanism: a memory encoder stores per-frame features, a memory bank maintains a fixed-size set of past frame representations, and cross-attention over the memory bank enables temporal consistency. SAM 2 uses Hiera (a hierarchical ViT) as the image encoder instead of ViT-H, improving efficiency. It was trained on SA-V, a new video dataset with 35.5M masks across 50.9k videos.

Grounded-SAM combines SAM with Grounding DINO to add open-vocabulary semantic labels — Grounding DINO detects boxes from text queries, SAM segments within those boxes. This two-stage composition became the de facto open-vocabulary segmentation pipeline before end-to-end alternatives emerged.

Efficiency variants. EfficientSAM and MobileSAM distill the ViT-H encoder into smaller backbones (ViT-Tiny, ViT-Small) for edge deployment, reducing the encoder from 632M to ~5–25M parameters with modest quality loss. FastSAM replaces the entire pipeline with a single-stage YOLOv8-based model that runs 50× faster.

Quality improvements. HQ-SAM adds a high-quality output token and a global-local feature fusion module to improve boundary quality on thin structures — directly addressing one of SAM's main failure modes. Semantic-SAM extends the multi-granularity output to six levels with semantic awareness.

Impact and Legacy

SAM reframed segmentation from a task-specific supervised problem into a foundation model problem. Before SAM, segmentation models were trained per-dataset (COCO, ADE20K, Cityscapes) with task-specific heads (Mask R-CNN, DeepLab). SAM showed that a single model trained at scale could match or approach these specialists in zero-shot, collapsing the need for per-domain annotation.

The practical impact was immediate: SAM became the default backbone for annotation tools (Label Studio, Roboflow, CVAT), enabling 10–20x speedups in mask labeling. It also became a building block for composed systems — Grounded-SAM for open-vocabulary segmentation, SAM + ControlNet for image editing, SAM + depth estimation for 3D reconstruction.

The deeper lesson is that scale in data (1.1B masks) and a well-chosen task formulation (promptable segmentation) can substitute for architectural novelty. SAM's architecture is not particularly novel — ViT encoder, transformer decoder, focal + dice loss are all standard components. The contribution is in the system design: the data engine, the task definition, and the engineering to make it work at scale.

SAM 2: Segment Anything in Images and Videos — extends SAM’s promptable segmentation to video with a streaming memory bank that propagates masks across frames
Florence-2 — unifies detection, segmentation, grounding, and captioning in one prompt-driven seq2seq model, expressing masks and boxes as location tokens
DETR: End-to-End Object Detection with Transformers — transformer-based detection architecture that influenced SAM's decoder design
CLIP: Learning Transferable Visual Models From Natural Language Supervision — provides the text encoder used for SAM's text prompt pathway
Vision Transformer (ViT) — the base architecture for SAM's image encoder
Faster R-CNN — the region proposal paradigm that SAM's promptable approach generalizes
DINO — self-supervised ViT pretraining, relevant to SAM's MAE-pretrained backbone