Sam 3: a unified foundation model for promptable segmentation in images/videos (github.com)

🤖 AI Summary
Meta has announced SAM 3, a groundbreaking unified foundation model designed for promptable segmentation in images and videos. This new iteration significantly enhances the capabilities of its predecessor, SAM 2, by allowing the segmentation of all instances related to an open-vocabulary concept defined by simple text phrases or visual exemplars. SAM 3 stands out by managing an impressive range of open-vocabulary prompts, achieving 75-80% of human performance on the new SA-CO benchmark, which includes a staggering 270,000 unique concepts, more than 50 times the size of previous benchmarks. The model is powered by an innovative data engine that has automatically annotated over 4 million concepts, establishing the largest high-quality open-vocabulary segmentation dataset to date. SAM 3 introduces a new architecture featuring a presence token to better differentiate similar text prompts, alongside a decoupled detector-tracker design that minimizes interference between tasks. With 848 million parameters, it leverages a DETR-based model for detection and a transformer encoder-decoder for tracking, making it a highly efficient tool for interactive segmentation tasks. This advancement not only pushes the boundaries of segmentation technology but also serves as a vital resource for researchers and developers working on complex visual recognition tasks.
Loading comments...
loading comments...