i'm learning about SAM 3 (Segment Anything Model 3) from Meta AI. i built a text-prompted image segmentation app using it, and wanted to understand how it actually works under the hood.

what is sam3?

sam3 is a model that detects, segments, and tracks objects in images and videos based on concept prompts. unlike traditional segmentation models that work with predefined categories, sam3 lets you specify what you want using:

simple noun phrases - "red apple", "striped cat", "person with hat"
image exemplars - bounding boxes around example objects (positive or negative)
combination of both - text + image examples for refinement

the key constraint: text prompts must be simple noun phrases, not complex queries. for more complex language prompts, sam3 can be combined with a multimodal large language model (MLLM).

promptable concept segmentation

sam3 introduces promptable concept segmentation (PCS) - a new task that takes text and/or image exemplars as input and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames.

this is different from sam 1 and 2's promptable visual segmentation (PVS) which only segments one object per prompt using points, boxes, or masks.

model architecture

sam3 uses a dual encoder-decoder transformer architecture with two main components: a detector and a tracker.

perception encoder

both the detector and tracker share a perception encoder (PE) - a vision-language backbone that processes:

image input
text prompt
image exemplars (if provided)

image exemplars and text tokens become "prompt tokens" that condition the model.

detector

the detector handles user inputs like noun phrases or image exemplars. it follows a DETR-like architecture:

fusion encoder - accepts unconditioned embeddings from image encoder and conditions them by cross-attending to the prompt tokens
decoder - learned object queries cross-attend to conditioned image embeddings from fusion encoder
output heads - predicts bounding boxes and segmentation masks

presence token

a key innovation: a learned presence token that decouples recognition from localization.

the presence token predicts: p(noun phrase is present in image)

each proposal query only needs to solve: p(query matches | noun phrase is present)

final score = proposal query score × presence score

this helps reduce false positives, especially with challenging negative examples.

tracker

the tracker is based on SAM 2's memory-based module. it ensures temporal consistency from frame to frame using:

memory bank - encodes object appearance using features from:

past frames
frames where object was first detected
user-prompted frames

memory encoder - transformer with:

self-attention across visual features on current frame
cross-attention from visual features to spatial memory in memory bank

during inference, only frames where the object is confidently present are retained in memory bank.

video processing

when a video and prompt are provided:

initialization - detector finds all objects on first frame, creates a masklet for each
each subsequent frame:
- detector finds new objects: O_t = detect(I_t, P)
- tracker propagates masklets from previous frame: M̂_t = propagate(M_t-1)
- matching function merges results: M_t = match_and_update(M̂_t, O_t)

a masklet is a spatio-temporal mask - tracking an object across multiple frames.

disambiguation strategies

two strategies resolve ambiguities in crowded scenes:

masklet detection score - measures how consistently a masklet matches detections within a temporal window. suppress if score < threshold.
detector-based correction - periodically re-prompt tracker with high-confidence detection masks to resolve occlusions/distractors. ensures memory bank has recent and reliable references.

handling ambiguity

the model predicts three output masks for every tracked object on each frame along with their confidence scores, then selects the most confident output.

data engine

achieving strong performance required a massive training dataset. meta built a human- and model-in-the-loop data engine with four phases.

key components

media pool - images and videos from web-scraped data and specialized domains
ontology - hierarchical set of entities (e.g., animals → mammals, reptiles)
noun phrase proposals - AI models generate concept descriptions
mask proposals - models (eventually sam3 itself) generate candidate masks
verification - humans and AI models verify mask quality and exhaustiveness

phase 1: human verification

randomly sample images and noun phrase proposals
initial mask proposals from SAM 2 + open-vocabulary detector
humans verify mask quality and exhaustiveness

phase 2: human + AI verification

key innovation: AI verifiers - fine-tuned Llama 3.2 models that perform verification tasks:

mask verification (MV) - accept/reject masks based on quality and relevance

exhaustivity verification (EV) - check if all instances are masked

this roughly doubled throughput by focusing humans on challenging cases.

also introduced hard negative noun phrases adversarial to sam3.

phase 3: scaling and domain expansion

expanded to 15 visual domains
mined long-tail concepts from 22.4M node ontology based on wikidata
iterated sam3 training 7 times, AI verifiers 3 times

phase 4: video annotation

extended to video using mature image sam3:

applied scene/motion filters and content balancing
sampled frames and sent to image annotation pipeline
generated masklets with video sam3
post-processed via deduplication and removal of trivial masks

references

SAM 3 Website

SAM 3: Segment Anything with Concepts (paper)

SAM 3 GitHub Repository

what is segment anything model 3?