Show HN: HuMo AI – Create Realistic Videos with Text, Image, and Audio Inputs (www.humoai.co)

0 points 7 hours ago ago | visit original

🤖 AI Summary

HuMo AI is a new tri‑modal video synthesis system that generates short, lifelike human videos from combinations of text, image, and audio inputs. It supports three generation modes—Text+Image (TI) to preserve a reference subject while following a script, Text+Audio (TA) to produce tight lip and facial sync to speech, and Text+Image+Audio (TIA) to balance subject consistency, semantic alignment, and A/V synchronization. The demo highlights editable text control (change outfits, scenes, or actions while keeping identity), strong subject preservation across frames, and convincing audio‑visual sync for dialogue and singing examples. Technically, HuMo produces ~4s clips by default (97 frames at 25fps), with 480p and 720p outputs and multi‑GPU inference support for larger runs. The authors provide a research paper and reference code for reproduction and experimentation; practical tips include using clean audio and tuning an audio guidance scale to improve sync. For the AI/ML community this advances multimodal conditioning for human-centric video—enabling faster character shots, virtual hosts, and e‑commerce try‑ons—while raising expectations for controllable, identity‑preserving generative video. The reference implementation makes it a useful baseline for research and product prototypes in controllable video generation.

Loading comments...

loading comments...