🤖 AI Summary
            Researchers from HKUST and Ant Group released HoloCine, a text-to-video system that generates full cinematic scenes as multi-shot long-form narratives rather than isolated clips. Given a global scene description plus shot-by-shot captions (and optional cut-frame timings), HoloCine produces temporally coherent videos with consistent characters, objects and visual style across shots—giving users directorial control over framing, cuts and pacing. The authors provide inference code, pre-trained checkpoints (HoloCine-14B and 5B variants, plus audio), and demo assets so researchers can reproduce multi-shot narratives up to and beyond one minute.
Technically, HoloCine models multi-shot sequences either with dense full-attention (better quality) or a faster sparse inter-shot attention mechanism implemented with FlashAttention (v2/v3 supported, v3 recommended). The pipeline builds on Wan2.2 components (T5 encoder and VAE) and fine-tuned DiT diffusion checkpoints (high/low noise pairs). Default sequence length is 241 frames and prompts follow a structured format (global_caption, shot_captions, shot_cut_frames) for precise control. Trade-offs are explicit: full attention yields more stable fidelity, sparse attention speeds up inference but can be slightly unstable. Released under CC BY-NC-SA for academic use, HoloCine advances long-range temporal consistency and controllability in text-to-video generation—opening practical pathways for automated storyboarding, multi-shot filmmaking experiments, and more sophisticated narrative generation.
        
            Loading comments...
        
        
        
        
        
            login to comment
        
        
        
        
        
        
        
        loading comments...
        no comments yet