BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration (github.com)

0 points 4 hours ago ago | visit original

🤖 AI Summary

Researchers from the University of Science and Technology of China and ByteDance released BindWeave, a unified framework and codebase for subject-consistent video generation (single- and multi-subject). BindWeave couples a pretrained multimodal LLM with a diffusion transformer (MLLM‑DiT) to achieve cross-modal integration: the MLLM parses complex prompts and reference images to produce subject‑aware hidden states (via entity grounding and representation alignment) that condition the DiT for high‑fidelity frame synthesis. The repo includes training and inference code, a BindWeave_Wan_14B checkpoint, scripts for prompt refinement (prompt_refine.sh), hidden‑state extraction (hiddenstates_extraction.sh on a dedicated branch), and guidance to use the WanX 2.1 14B pretrained components (VAE, text encoder, image encoder) plus a conversion script (convert_ckpt.py) before running inference (inference_s2v.sh). Significance for AI/ML: BindWeave demonstrates a practical, modular approach to grounding subjects in generated video by using MLLM hidden states as an explicit conditioning signal, enabling finer subject consistency and multi-subject control. On OpenS2V‑Eval it scores 57.61, competitive with other 14B systems (e.g., VACE‑14B 57.55), showing especially strong motion smoothness (95.9%) but lower motion amplitude (13.9%)—indicative of stable but conservative motion. The release lowers the barrier to reproduce and extend the method, though it requires large pretrained components and nontrivial conversion/feature extraction steps, making it most suitable for teams with substantial compute.

Loading comments...

loading comments...