🤖 AI Summary
Character.AI has open‑sourced Ovi, a twin‑backbone, cross‑modal video+audio generative model (checkpoint: ~11B) that simultaneously produces synchronized video and audio from text or text+image prompts. The release includes code, training/inference scripts, a Gradio demo, multi‑GPU and sharded (FSDP) inference support, example prompts, and a research paper describing the architecture. Demos are available on Wavespeed and Hugging Face, and the project builds on Wan2.2 for video and MMAudio components for audio.
Technically, Ovi generates 5‑second clips at 24 FPS (default 720×720, multiple aspect ratios) and uses diffusion sampling (default num_steps=50, solver "unipc") with cross‑modal guidance knobs (audio_guidance_scale, video_guidance_scale) and Skip Layer Guidance (slg_layer=11). Prompts can embed speech (<S>…<E>) and audio descriptions (<AUDCAP>…<ENDAUDCAP>) to control speech and sound effects. Inference supports sequence parallelism, FlashAttention-3, and CPU offload; minimum GPU requirement is ~32 GB with higher‑performance runs reaching ~80 GB VRAM. End‑to‑end times vary (e.g., ~40–140s depending on GPU count and config). Roadmap items include higher‑resolution and longer videos, distilled faster models, and community contributions—making Ovi a practical, extensible platform for multimodal research and creative applications.
Loading comments...
login to comment
loading comments...
no comments yet