i'm learning about SAM 3 (Segment Anything Model 3) from Meta AI. i built a text-prompted image segmentation app using it, and wanted to understand how it actually works under the hood.
what is sam3?
sam3 is a model that detects, segments, and tracks objects in images and videos based on concept prompts. unlike traditional segmentation models that work with predefined categories, sam3 lets you specify what you want using:
- simple noun phrases - "red apple", "striped cat", "person with hat"
- image exemplars - bounding boxes around example objects (positive or negative)
- combination of both - text + image examples for refinement
the key constraint: text prompts must be simple noun phrases, not complex queries. for more complex language prompts, sam3 can be combined with a multimodal large language model (MLLM).
promptable concept segmentation
sam3 introduces promptable concept segmentation (PCS) - a new task that takes text and/or image exemplars as input and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames.
this is different from sam 1 and 2's promptable visual segmentation (PVS) which only segments one object per prompt using points, boxes, or masks.
model architecture
sam3 uses a dual encoder-decoder transformer architecture with two main components: a detector and a tracker.
perception encoder
both the detector and tracker share a perception encoder (PE) - a vision-language backbone that processes:
- image input
- text prompt
- image exemplars (if provided)
image exemplars and text tokens become "prompt tokens" that condition the model.
detector
the detector handles user inputs like noun phrases or image exemplars. it follows a DETR-like architecture:
- fusion encoder - accepts unconditioned embeddings from image encoder and conditions them by cross-attending to the prompt tokens
- decoder - learned object queries cross-attend to conditioned image embeddings from fusion encoder
- output heads - predicts bounding boxes and segmentation masks
presence token
a key innovation: a learned presence token that decouples recognition from localization.
the presence token predicts: p(noun phrase is present in image)
each proposal query only needs to solve: p(query matches | noun phrase is present)
final score = proposal query score × presence score
this helps reduce false positives, especially with challenging negative examples.
tracker
the tracker is based on SAM 2's memory-based module. it ensures temporal consistency from frame to frame using:
memory bank - encodes object appearance using features from:
- past frames
- frames where object was first detected
- user-prompted frames
memory encoder - transformer with:
- self-attention across visual features on current frame
- cross-attention from visual features to spatial memory in memory bank
during inference, only frames where the object is confidently present are retained in memory bank.
video processing
when a video and prompt are provided:
- initialization - detector finds all objects on first frame, creates a masklet for each
- each subsequent frame:
- detector finds new objects:
O_t = detect(I_t, P) - tracker propagates masklets from previous frame:
M̂_t = propagate(M_t-1) - matching function merges results:
M_t = match_and_update(M̂_t, O_t)
- detector finds new objects:
a masklet is a spatio-temporal mask - tracking an object across multiple frames.
disambiguation strategies
two strategies resolve ambiguities in crowded scenes:
- masklet detection score - measures how consistently a masklet matches detections within a temporal window. suppress if score < threshold.
- detector-based correction - periodically re-prompt tracker with high-confidence detection masks to resolve occlusions/distractors. ensures memory bank has recent and reliable references.
handling ambiguity
the model predicts three output masks for every tracked object on each frame along with their confidence scores, then selects the most confident output.
data engine
achieving strong performance required a massive training dataset. meta built a human- and model-in-the-loop data engine with four phases.
key components
- media pool - images and videos from web-scraped data and specialized domains
- ontology - hierarchical set of entities (e.g., animals → mammals, reptiles)
- noun phrase proposals - AI models generate concept descriptions
- mask proposals - models (eventually sam3 itself) generate candidate masks
- verification - humans and AI models verify mask quality and exhaustiveness
phase 1: human verification
- randomly sample images and noun phrase proposals
- initial mask proposals from SAM 2 + open-vocabulary detector
- humans verify mask quality and exhaustiveness
phase 2: human + AI verification
key innovation: AI verifiers - fine-tuned Llama 3.2 models that perform verification tasks:
mask verification (MV) - accept/reject masks based on quality and relevance
exhaustivity verification (EV) - check if all instances are masked
this roughly doubled throughput by focusing humans on challenging cases.
also introduced hard negative noun phrases adversarial to sam3.
phase 3: scaling and domain expansion
- expanded to 15 visual domains
- mined long-tail concepts from 22.4M node ontology based on wikidata
- iterated sam3 training 7 times, AI verifiers 3 times
phase 4: video annotation
extended to video using mature image sam3:
- applied scene/motion filters and content balancing
- sampled frames and sent to image annotation pipeline
- generated masklets with video sam3
- post-processed via deduplication and removal of trivial masks