Maya1: Open-source 3B Voice Model (huggingface.co)

0 points 6 hours ago ago | visit original

🤖 AI Summary

Maya1 is a newly released open‑source speech model from Maya Research designed for expressive, emotion‑rich voice generation. It’s a 3B-parameter, Llama‑style decoder trained to predict SNAC neural codec tokens (7 tokens/frame) instead of raw waveforms, producing 24 kHz audio at ~0.98 kbps. Users specify voices with plain-language descriptors inside an XML-style tag (e.g., <description="40-yr old, low-pitch, warm">) and can insert 20+ emotion cues like <laugh>, <cry>, <whisper>. The model runs on a single GPU (16GB+ recommended: A100/H100/RTX4090), supports real‑time streaming with sub‑100ms latency via vLLM, automatic prefix caching, WebAudio integration, and is released under Apache 2.0 on Hugging Face with example code for transformers + SNAC decoding. Technically notable: Maya1 is a production‑ready pipeline that packs SNAC codes compactly (multi‑scale hierarchical structure) so autoregressive generation is feasible in real time, and the team combined internet‑scale pretraining with supervised fine‑tuning on curated studio recordings annotated with human‑verified voice descriptions and emotion tags. That makes it a practical open alternative to closed TTS services (no per‑second fees, full customization and fine‑tuning), useful for assistants, game characters, audiobooks and accessibility tools. Because it enables realistic persona and emotion synthesis, it’s a powerful research and product tool — but also raises standard misuse and ethical considerations around voice cloning that deployments should address.

Loading comments...

loading comments...