SAM2 (Segment Anything 2) introduces a user-friendly human-computer interaction paradigm, enabling users to generate full video mask trajectories through simple or complex prompts. This innovation dramatically reduces video annotation labor from per-frame instance labeling to single-annotation tracking, significantly compressing time costs for video data preparation. To achieve fully automated annotation, I think developing automatic prompt generation systems for SAM2 worth a shot.
SAM2 supports mask generation using points/bounding boxes on specified frames, as well as video inference through points/boxes/mask prompts. Among these three approaches, I typically first obtain initial masks via SAM using point/box prompts, then feed these masks into SAM2 for temporal propagation. When comparing point vs box prompting, bounding boxes prove more automation-friendly. Point-based methods inherently require human-curated positive/negative points, demanding interactive visualization. While box-generated masks may lack precision, automated box generation remains comparatively simpler. For users possessing substantial domain-specific segmentation data, training custom YOLO models for detection/segmentation is advisable. However, in most zero-shot application scenarios, existing generalized models are all I have.
Object Detection Models
Now let's discuss how to combine detection networks with SAM2 for video inference. Among detection models, I recommend GroundingDINO and Florence-2-large-ft. Notably, Florence-2 offers multi-functional capabilities including object detection, grounding, and caption+grounding. Through empirical testing, I found GroundingDINO outperforms Florence-2 in grounding tasks, thus primarily utilizing Florence-2 for detection.
The critical distinction lies in GroundingDINO's requirement for text captions as prompts. Several examples indicate optimal performance when using single-keyword captions formatted as 'keyword1,keyword2,...' - this approach maintains detection stability while allowing multi-instance capture. However, GroundingDINO exhibits high sensitivity to hyperparameter tuning. Excessive box thresholds increase missed detections, while insufficient thresholds boost false positives. Parameter optimization must consider image quality and model generalization - in layman's terms, expect substantial trial-and-error.
SAM2 Integration
Having established frame-wise bounding box generation capabilities, a basic implementation involves acquiring mask prompts from the first frame and tracking instances throughout the video. Reference implementations include Florence2 + SAM2 and Grounded-SAM-2. These solutions suffice for tracking single persistent instances (e.g., camera-followed objects). However, dynamic scenes requiring detection of emerging objects demand more sophisticated approaches.
For comprehensive video instance tracking, I cannot rely solely on initial frame detections. My goal is full automation rather than assisted annotation, eliminating real-time constraints. Here's how I think about it:
- Require per-frame detection with novel instance identification
- Generate initial frame masks, then detect new instances in subsequent frames
- Determine novelty through mask overlap analysis between detections and existing instances
- Implement quality control to filter SAM's fragmented outputs (familiar users recognize these artifacts)
- Address detection inconsistencies - new boxes might represent previously missed instances
- Enable temporal backward propagation since SAM2 lacks native backward inference
- Implement duplicate prevention through mask overlap checks and quality filtering
Thus, the implementation logic unfolds as follows:
- Frame extraction from video
- Per-frame object detection
- Iterative processing from initial frame:
- Convert current frame bounding boxes to masks via SAM with post-processing
- Identify novel instances through mask overlap analysis
- Feed masks to SAM2 for video propagation
- Reinitialize video predictor with reversed frame other and repropagate for backward propagation
- Deduplicate trajectories through overlap checks
- Apply post-processing and update master mask repository
Here is the core implementation (using Florence 2 as an example):
1 | import gradio as gr |
Example of Florence 2 + SAM 2
Example of GroundingDINO + SAM 2:
To contact me, send an email to zihanliu@hotmail.com.