TL;DR
- SAM 2 generalizes SAM’s promptable segmentation from single images to video: a single prompt on one frame propagates the mask across all subsequent frames without additional user input.
- A streaming memory bank stores spatial features and object pointers from past frames; a memory attention module lets the current frame cross-attend to these stored memories, providing temporal context analogous to a KV cache over time.
- An occlusion head predicts whether the object is visible in each frame, preventing mask collapse when the object disappears and enabling re-detection when it reappears.
- SAM 2 unifies image and video segmentation in a single real-time model, trained on the SA-V dataset (50.9k videos, 35.5M masks), and achieves strong zero-shot video object segmentation across multiple benchmarks.
From images to video
SAM defined promptable image segmentation: given a point, box, or mask prompt, produce a valid segmentation mask for any object. The model is remarkably general, but it has no notion of time — each image is processed in isolation.
SAM 2 extends this to video. The setup is streaming: frames arrive one at a time, in order. The user provides one or more prompts on one (or a few) frames, and SAM 2 propagates the corresponding mask forward (and backward) across the entire clip. For image segmentation, SAM 2 simply runs the same pipeline on a single-frame clip, so images and videos share a single unified model.
The architectural change that makes this possible is a streaming memory bank paired with a memory attention module. Instead of discarding per-frame computation after producing a mask, SAM 2 stores a compact representation of each processed frame. When processing the next frame, it retrieves these stored memories and uses cross-attention to condition the current frame’s features on the object’s history. This gives the model a persistent, temporally coherent view of the object across the clip.
The model was trained on SA-V, a new video dataset built with a similar model-in-the-loop annotation engine to SA-1B, containing 50.9k videos and 35.5 million masks — by far the largest annotated video segmentation dataset. The backbone is Hiera, a hierarchical vision transformer that is more efficient than the ViT-H used in original SAM.
Streaming memory bank
The memory bank stores representations from two sources: a fixed number of recent frames processed by the model, and a set of object pointers — lightweight embedding vectors extracted from the mask decoder’s output tokens. Object pointers summarize what the object looked like at each remembered frame without storing a full spatial feature map, keeping memory overhead low.
As frames are processed sequentially, the memory bank is updated in a sliding-window fashion: the oldest frame entry is evicted when the bank is full, while object pointers are retained for a longer horizon. This gives the model a sense of recent appearance (spatial features) plus a compressed history of the object’s identity (object pointers).
At each new frame, the memory encoder compresses the current frame’s spatial features alongside the predicted mask into a format suitable for storage. The mask decoder then uses memory attention (described below) to condition on the stored bank before predicting the next mask. The whole pipeline is causal: only past frames are in memory, so the system can run in real time without waiting to see future frames.
Memory attention
Memory attention is the mechanism that connects the current frame to its stored history. It is a form of cross-attention: the current frame’s image features act as the queries, and the stored memory entries (spatial feature maps from past frames plus object-pointer embeddings) supply the keys and values.
This is conceptually identical to how a KV cache works in autoregressive language models. In a language model, past token representations are cached as keys and values so that each new token can attend to the full context without recomputing it. In SAM 2, past frame representations are cached in the memory bank so that the current frame can attend to the object’s history without reprocessing earlier frames. The analogy is tight: both are cross-attention over a growing set of past key-value pairs.
Conditioning on more memory entries gives the model a richer view of the object’s appearance over time, which helps when the object changes scale, pose, or illumination across frames. In practice the model attends to a fixed-size window of the most recent frames plus all retained object pointers, balancing context richness against compute cost.
Surviving occlusion
Video object segmentation has a hard failure mode: occlusion. When another object passes in front of the target — or the target moves behind a wall, dips below a waterline, or exits the frame entirely — there is no mask to produce. A naive model might hallucinate a mask on whatever is nearby, or collapse to an empty prediction that never recovers.
SAM 2 addresses this with a dedicated occlusion head: a lightweight binary classifier that runs on the mask decoder’s output and predicts whether the object is present in the current frame. When the occlusion head predicts absence, SAM 2 suppresses the mask output entirely rather than producing a spurious prediction. Crucially, the memory bank still records the frame even when the object is occluded, so the model retains a continuous record of the clip.
When the object reappears, the occlusion head switches back to predicting presence, and the memory bank provides the stored appearance information the model needs to re-acquire the object. The recovered mask picks up from the stored representations rather than starting from scratch, which is why re-detection after long occlusions is qualitatively better than running SAM frame-by-frame.
Why it mattered
SAM 2 was the first model to offer promptable segmentation for both images and video in a single architecture, running at interactive speeds (around 44 frames per second on an A100). Prior video object segmentation models required per-object fine-tuning or were limited to specific camera settings; SAM 2 works zero-shot across surveillance, sports, medical, and natural-video footage.
The streaming memory design is practical for deployment: memory cost scales with the number of stored frames, not with clip length, because the window is fixed. This makes the model usable on long videos without quadratic memory growth.
The SA-V dataset released alongside the model has become a standard benchmark for video segmentation research, and the open-sourced model weights have been integrated into annotation tools, robotics pipelines, and video editing workflows in much the same way that original SAM reshaped image annotation.
Related Reading
- Segment Anything (SAM) — the image-only predecessor whose promptable segmentation task and mask decoder architecture SAM 2 directly extends
- DETR — transformer-based detection that introduced learned object queries, an idea related to SAM 2’s object-pointer embeddings
- CLIP — vision-language alignment that established the foundation-model paradigm for visual understanding that both SAM and SAM 2 follow
