Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos (arxiv.org)

🤖 AI Summary
Researchers show that pretrained image diffusion models—originally built for static image synthesis—implicitly encode pixel-level semantic correspondences that can be repurposed for video understanding. By reinterpreting the models’ self-attention maps as semantic label-propagation kernels and extending these across frames, the team derives a temporal propagation kernel that performs zero-shot object tracking by segmentation. They further improve robustness with test-time optimization techniques (DDIM inversion, textual inversion, and adaptive head weighting) to better align diffusion features to a target object. Combining these insights, they introduce DRIFT: a pipeline that uses a frozen image diffusion model for propagation and refines masks with SAM, yielding state-of-the-art zero-shot performance on standard video object segmentation benchmarks. This work is significant because it reveals an emergent temporal propagation capability in image diffusion architectures, enabling video tasks without video-specific training or labeled temporal data. Technically, it highlights that self-attention in diffusion models serves as a powerful correspondence kernel, and that modest test-time adaptations plus mask refinement can close the gap to supervised trackers. The result opens a new direction for leveraging large generative models for downstream spatiotemporal tasks, reducing reliance on annotated video datasets and suggesting new hybrid systems that combine generative attention with specialized refiners like SAM.
Loading comments...
loading comments...