LongCat-Video: 13.6B text-to-video, image-to-video, and video-continuation model (huggingface.co)

0 points 4 days ago ago | visit original

🤖 AI Summary

Meituan’s LongCat-Video is a new 13.6B-parameter foundational video generation model that unifies Text-to-Video, Image-to-Video and Video-Continuation in a single dense architecture. Uniquely pretrained on video-continuation tasks, it is designed to produce minutes‑long clips without the common color drift or quality collapse seen in long outputs — positioning it as an early step toward continuous “world models” for video. Evaluations (MOS) show LongCat-Video delivers visual, motion and alignment quality comparable to leading open-source and commercial systems, despite using fewer parameters than some MoE competitors. Key technical highlights: the model uses a coarse‑to‑fine generation strategy across temporal and spatial axes to speed up inference (720p, 30fps videos generated in minutes), plus Block Sparse Attention and support for FlashAttention (v2 default, v3 or xformers optional) to scale efficiently at high resolution. Training refinement employed multi‑reward Group Relative Policy Optimization (GRPO) RLHF to improve human-aligned outputs. Weights and code are available under an MIT license on Hugging Face and GitHub with single‑ and multi‑GPU demos provided. The team notes standard caveats: not exhaustively evaluated for every downstream use, so developers should validate safety, fairness and legal compliance before deployment.

Loading comments...

loading comments...