Autonomous Video Tracking & Segmentation

March 13, 2025 Total Views for This Post:

SAM2 (Segment Anything 2) introduces a user-friendly human-computer interaction paradigm, enabling users to generate full video mask trajectories through simple or complex prompts. This innovation dramatically reduces video annotation labor from per-frame instance labeling to single-annotation tracking, significantly compressing time costs for video data preparation. To achieve fully automated annotation, I think developing automatic prompt generation systems for SAM2 worth a shot.

SAM2 supports mask generation using points/bounding boxes on specified frames, as well as video inference through points/boxes/mask prompts. Among these three approaches, I typically first obtain initial masks via SAM using point/box prompts, then feed these masks into SAM2 for temporal propagation. When comparing point vs box prompting, bounding boxes prove more automation-friendly. Point-based methods inherently require human-curated positive/negative points, demanding interactive visualization. While box-generated masks may lack precision, automated box generation remains comparatively simpler. For users possessing substantial domain-specific segmentation data, training custom YOLO models for detection/segmentation is advisable. However, in most zero-shot application scenarios, existing generalized models are all I have.

Object Detection Models

Now let's discuss how to combine detection networks with SAM2 for video inference. Among detection models, I recommend GroundingDINO and Florence-2-large-ft. Notably, Florence-2 offers multi-functional capabilities including object detection, grounding, and caption+grounding. Through empirical testing, I found GroundingDINO outperforms Florence-2 in grounding tasks, thus primarily utilizing Florence-2 for detection.

The critical distinction lies in GroundingDINO's requirement for text captions as prompts. Several examples indicate optimal performance when using single-keyword captions formatted as 'keyword1,keyword2,...' - this approach maintains detection stability while allowing multi-instance capture. However, GroundingDINO exhibits high sensitivity to hyperparameter tuning. Excessive box thresholds increase missed detections, while insufficient thresholds boost false positives. Parameter optimization must consider image quality and model generalization - in layman's terms, expect substantial trial-and-error.

SAM2 Integration

Having established frame-wise bounding box generation capabilities, a basic implementation involves acquiring mask prompts from the first frame and tracking instances throughout the video. Reference implementations include Florence2 + SAM2 and Grounded-SAM-2. These solutions suffice for tracking single persistent instances (e.g., camera-followed objects). However, dynamic scenes requiring detection of emerging objects demand more sophisticated approaches.

For comprehensive video instance tracking, I cannot rely solely on initial frame detections. My goal is full automation rather than assisted annotation, eliminating real-time constraints. Here's how I think about it:

Require per-frame detection with novel instance identification
Generate initial frame masks, then detect new instances in subsequent frames
Determine novelty through mask overlap analysis between detections and existing instances
Implement quality control to filter SAM's fragmented outputs (familiar users recognize these artifacts)
Address detection inconsistencies - new boxes might represent previously missed instances
Enable temporal backward propagation since SAM2 lacks native backward inference
Implement duplicate prevention through mask overlap checks and quality filtering

Thus, the implementation logic unfolds as follows:

Frame extraction from video
Per-frame object detection
Iterative processing from initial frame:
Convert current frame bounding boxes to masks via SAM with post-processing
Identify novel instances through mask overlap analysis
Feed masks to SAM2 for video propagation
Reinitialize video predictor with reversed frame other and repropagate for backward propagation
Deduplicate trajectories through overlap checks
Apply post-processing and update master mask repository

Here is the core implementation (using Florence 2 as an example):

import gradio as gr
import numpy as np
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
...

# Functions for Florence 2 Inference
models = {
    'microsoft/Florence-2-large-ft': AutoModelForCausalLM.from_pretrained('/data/zihan/florence2_sam2/Florence-2-large-ft', trust_remote_code=True).to("cuda").eval(),
}
processors = {
    'microsoft/Florence-2-large-ft': AutoProcessor.from_pretrained('/data/zihan/florence2_sam2/Florence-2-large-ft', trust_remote_code=True),
}

@spaces.GPU
def run_example(task_prompt, image, text_input=None, model_id='microsoft/Florence-2-large'):
    model = models[model_id]
    processor = processors[model_id]
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1048,
        early_stopping=False,
        do_sample=False,
        num_beams=5,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )
    return parsed_answer

def process_image(image, model_id='microsoft/Florence-2-large-ft'):
    image = Image.fromarray(image)
    task_prompt = '<OD>'
    results = run_example(task_prompt, image, model_id=model_id)
    return results

def florence_inf(video):
    cap = cv2.VideoCapture(video)
    frame_bboxes = []
    while(True):
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        output_text = process_image(frame)
        bboxes = output_text['<OD>']['bboxes']
        frame_bboxes.append(bboxes)
    cap.release()
    return frame_bboxes,'Florence-2 OD has done its job ξ( ✿＞◡❛)'

def video2images(video, output_folder):
    # Read the video and save the images to a path
    # ... (skipped for brevity) ...
def crashed(mask):
    # Check if the masked region is connected
    # ... (skipped for brevity) ...
def overlapped(exist_masks, new_mask):
    # Check if new mask is overlaped with existing masks
    # ... (skipped for brevity) ...
def fill_small_bubbles(mask):
    # Find and fill blank region inside a mask
    # ... (skipped for brevity) ...
def median_filter(mask):
    # ... (skipped for brevity) ...

def sam2_inference(images_path, frame_bboxes, save_dir_name, vis_frame_stride=1):
    # Set checkpoints, device, paths, etc.
    # ... (skipped for brevity) ...

    # Initialize SAM2 for image and video predictor
    from sam2.build_sam import build_sam2_video_predictor, build_sam2
    from sam2.sam2_image_predictor import SAM2ImagePredictor
    video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device)
    inference_state = video_predictor.init_state(video_path=images_path)
    sam2_model = build_sam2(model_cfg, sam2_checkpoint, device=device)
    image_predictor = SAM2ImagePredictor(sam2_model)

    video_segments = {} # Saved masks
    # The loop starts
    for ann_frame_idx in range(num_frames):
        video_predictor.reset_state(inference_state)
        # Get predicted masks from boxes through image predictor
        img_predictor.set_image(image)
        masks_all, _, _ = img_predictor.predict(
            point_coords=None,
            point_labels=None,
            box=bboxes,
            multimask_output=False,
        )
        # post-processing
        masks = []
        if ann_frame_idx in video_segments:
            for i in range(len(masks_all)):
                if overlapped(video_segments[ann_frame_idx], masks_all[i]) or crashed(masks_all[i]):
                    continue
                else:
                    masks.append(masks_all[i])

        # Add masks to the video predictor and video propagation
        # ... (skipped for brevity) ...

        # Reverse the video
        inference_state['images'] = torch.flip(inference_state['images'], dims=[0])
        predictor.reset_state(inference_state)

        # Add masks to the video predictor and video propagation
        # ... (skipped for brevity) ...

        # Reserse the video again
        inference_state['images'] = torch.flip(inference_state['images'], dims=[0])

    # Post-process all the generated masks and save
    for out_frame_idx in range(0, len(frame_names), vis_frame_stride):
        for out_obj_id, out_mask in video_segments[out_frame_idx].items():
            # Check if the output mask should be added
            if crashed(out_mask) or overlapped(exist_masks, out_mask):
                continue
            out_mask = median_filter(out_mask)
            out_mask = fill_small_bubbles(out_mask)
            # Visualization and write
            # ... (skipped for brevity) ...

def images2video(image_folder, output_video_path, fps=24):
    # Frame to video conversion logic
    # ... (skipped for brevity) ...

def sam2_video(input_video, florence_bboxes):
    video2images(...)
    sam2_inference(...)
    images2video(...)
    return output_video, output_mask

if __name__ == "__main__":
    
    with gr.Blocks() as demo:
        
        gr.Markdown("# Florence-2 + SAM2")

        with gr.Row():
            with gr.Column():
                input_video = gr.Video(format='mp4',label='Source Video')
                florence_bboxes = gr.State() 
                terminal = gr.Textbox(label='Pseudo Terminal')
            with gr.Column():
                output_video = gr.Video(format='mp4',label="SAM2 Vis",show_download_button=True)
                output_mask = gr.Video(format='mp4',label="Mask",show_download_button=True)
        
        input_video.upload(florence_inf, [input_video], [florence_bboxes,terminal])
        terminal.change(sam2_video, [input_video, florence_bboxes], [output_video, output_mask])

    demo.launch(server_port=your_port)

Example of Florence 2 + SAM 2

Example of GroundingDINO + SAM 2：

To contact me, send an email to zihanliu@hotmail.com.