Autonomous Video Tracking & Segmentation

SAM2 (Segment Anything 2) introduces a user-friendly human-computer interaction paradigm, enabling users to generate full video mask trajectories through simple or complex prompts. This innovation dramatically reduces video annotation labor from per-frame instance labeling to single-annotation tracking, significantly compressing time costs for video data preparation. To achieve fully automated annotation, I think developing automatic prompt generation systems for SAM2 worth a shot.

SAM2 supports mask generation using points/bounding boxes on specified frames, as well as video inference through points/boxes/mask prompts. Among these three approaches, I typically first obtain initial masks via SAM using point/box prompts, then feed these masks into SAM2 for temporal propagation. When comparing point vs box prompting, bounding boxes prove more automation-friendly. Point-based methods inherently require human-curated positive/negative points, demanding interactive visualization. While box-generated masks may lack precision, automated box generation remains comparatively simpler. For users possessing substantial domain-specific segmentation data, training custom YOLO models for detection/segmentation is advisable. However, in most zero-shot application scenarios, existing generalized models are all I have.

Object Detection Models

Now let's discuss how to combine detection networks with SAM2 for video inference. Among detection models, I recommend GroundingDINO and Florence-2-large-ft. Notably, Florence-2 offers multi-functional capabilities including object detection, grounding, and caption+grounding. Through empirical testing, I found GroundingDINO outperforms Florence-2 in grounding tasks, thus primarily utilizing Florence-2 for detection.

The critical distinction lies in GroundingDINO's requirement for text captions as prompts. Several examples indicate optimal performance when using single-keyword captions formatted as 'keyword1,keyword2,...' - this approach maintains detection stability while allowing multi-instance capture. However, GroundingDINO exhibits high sensitivity to hyperparameter tuning. Excessive box thresholds increase missed detections, while insufficient thresholds boost false positives. Parameter optimization must consider image quality and model generalization - in layman's terms, expect substantial trial-and-error.

SAM2 Integration

Having established frame-wise bounding box generation capabilities, a basic implementation involves acquiring mask prompts from the first frame and tracking instances throughout the video. Reference implementations include Florence2 + SAM2 and Grounded-SAM-2. These solutions suffice for tracking single persistent instances (e.g., camera-followed objects). However, dynamic scenes requiring detection of emerging objects demand more sophisticated approaches.

For comprehensive video instance tracking, I cannot rely solely on initial frame detections. My goal is full automation rather than assisted annotation, eliminating real-time constraints. Here's how I think about it:

  1. Require per-frame detection with novel instance identification
  2. Generate initial frame masks, then detect new instances in subsequent frames
  3. Determine novelty through mask overlap analysis between detections and existing instances
  4. Implement quality control to filter SAM's fragmented outputs (familiar users recognize these artifacts)
  5. Address detection inconsistencies - new boxes might represent previously missed instances
  6. Enable temporal backward propagation since SAM2 lacks native backward inference
  7. Implement duplicate prevention through mask overlap checks and quality filtering

Thus, the implementation logic unfolds as follows:

  1. Frame extraction from video
  2. Per-frame object detection
  3. Iterative processing from initial frame:
  4. Convert current frame bounding boxes to masks via SAM with post-processing
  5. Identify novel instances through mask overlap analysis
  6. Feed masks to SAM2 for video propagation
  7. Reinitialize video predictor with reversed frame other and repropagate for backward propagation
  8. Deduplicate trajectories through overlap checks
  9. Apply post-processing and update master mask repository

Here is the core implementation (using Florence 2 as an example):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import gradio as gr
import numpy as np
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
...

# Functions for Florence 2 Inference
models = {
'microsoft/Florence-2-large-ft': AutoModelForCausalLM.from_pretrained('/data/zihan/florence2_sam2/Florence-2-large-ft', trust_remote_code=True).to("cuda").eval(),
}
processors = {
'microsoft/Florence-2-large-ft': AutoProcessor.from_pretrained('/data/zihan/florence2_sam2/Florence-2-large-ft', trust_remote_code=True),
}

@spaces.GPU
def run_example(task_prompt, image, text_input=None, model_id='microsoft/Florence-2-large'):
model = models[model_id]
processor = processors[model_id]
if text_input is None:
prompt = task_prompt
else:
prompt = task_prompt + text_input
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1048,
early_stopping=False,
do_sample=False,
num_beams=5,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task=task_prompt,
image_size=(image.width, image.height)
)
return parsed_answer

def process_image(image, model_id='microsoft/Florence-2-large-ft'):
image = Image.fromarray(image)
task_prompt = '<OD>'
results = run_example(task_prompt, image, model_id=model_id)
return results

def florence_inf(video):
cap = cv2.VideoCapture(video)
frame_bboxes = []
while(True):
ret, frame = cap.read()
if not ret:
break
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
output_text = process_image(frame)
bboxes = output_text['<OD>']['bboxes']
frame_bboxes.append(bboxes)
cap.release()
return frame_bboxes,'Florence-2 OD has done its job ξ( ✿>◡❛)'

def video2images(video, output_folder):
# Read the video and save the images to a path
# ... (skipped for brevity) ...
def crashed(mask):
# Check if the masked region is connected
# ... (skipped for brevity) ...
def overlapped(exist_masks, new_mask):
# Check if new mask is overlaped with existing masks
# ... (skipped for brevity) ...
def fill_small_bubbles(mask):
# Find and fill blank region inside a mask
# ... (skipped for brevity) ...
def median_filter(mask):
# ... (skipped for brevity) ...

def sam2_inference(images_path, frame_bboxes, save_dir_name, vis_frame_stride=1):
# Set checkpoints, device, paths, etc.
# ... (skipped for brevity) ...

# Initialize SAM2 for image and video predictor
from sam2.build_sam import build_sam2_video_predictor, build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device)
inference_state = video_predictor.init_state(video_path=images_path)
sam2_model = build_sam2(model_cfg, sam2_checkpoint, device=device)
image_predictor = SAM2ImagePredictor(sam2_model)

video_segments = {} # Saved masks
# The loop starts
for ann_frame_idx in range(num_frames):
video_predictor.reset_state(inference_state)
# Get predicted masks from boxes through image predictor
img_predictor.set_image(image)
masks_all, _, _ = img_predictor.predict(
point_coords=None,
point_labels=None,
box=bboxes,
multimask_output=False,
)
# post-processing
masks = []
if ann_frame_idx in video_segments:
for i in range(len(masks_all)):
if overlapped(video_segments[ann_frame_idx], masks_all[i]) or crashed(masks_all[i]):
continue
else:
masks.append(masks_all[i])

# Add masks to the video predictor and video propagation
# ... (skipped for brevity) ...

# Reverse the video
inference_state['images'] = torch.flip(inference_state['images'], dims=[0])
predictor.reset_state(inference_state)

# Add masks to the video predictor and video propagation
# ... (skipped for brevity) ...

# Reserse the video again
inference_state['images'] = torch.flip(inference_state['images'], dims=[0])

# Post-process all the generated masks and save
for out_frame_idx in range(0, len(frame_names), vis_frame_stride):
for out_obj_id, out_mask in video_segments[out_frame_idx].items():
# Check if the output mask should be added
if crashed(out_mask) or overlapped(exist_masks, out_mask):
continue
out_mask = median_filter(out_mask)
out_mask = fill_small_bubbles(out_mask)
# Visualization and write
# ... (skipped for brevity) ...

def images2video(image_folder, output_video_path, fps=24):
# Frame to video conversion logic
# ... (skipped for brevity) ...

def sam2_video(input_video, florence_bboxes):
video2images(...)
sam2_inference(...)
images2video(...)
return output_video, output_mask

if __name__ == "__main__":

with gr.Blocks() as demo:

gr.Markdown("# Florence-2 + SAM2")

with gr.Row():
with gr.Column():
input_video = gr.Video(format='mp4',label='Source Video')
florence_bboxes = gr.State()
terminal = gr.Textbox(label='Pseudo Terminal')
with gr.Column():
output_video = gr.Video(format='mp4',label="SAM2 Vis",show_download_button=True)
output_mask = gr.Video(format='mp4',label="Mask",show_download_button=True)

input_video.upload(florence_inf, [input_video], [florence_bboxes,terminal])
terminal.change(sam2_video, [input_video, florence_bboxes], [output_video, output_mask])

demo.launch(server_port=your_port)

Example of Florence 2 + SAM 2

Example of GroundingDINO + SAM 2:

To contact me, send an email to zihanliu@hotmail.com.