Show HN: 150M Mandarin transcription model with real-time metadata detection (huggingface.co)

🤖 AI Summary
A new dual-head Mandarin Chinese Automatic Speech Recognition (ASR) model, featuring 150 million parameters, has been introduced by WhissleAI. This innovative model not only transcribes speech into text but also classifies speaker attributes, such as age, gender, and dialect, in a single forward pass. Built on the NVIDIA Citrinet-1024 architecture with language-specific bottleneck adapters, it has been fine-tuned on 60 hours of meta-annotated Mandarin speech data, achieving a word error rate (WER) of 19.22% and a tag accuracy of 94.2%. This advancement is significant for the AI/ML community as it combines transcription accuracy with the ability to generate rich, structured metadata, allowing for more nuanced understanding and applications of spoken language. The model employs a connectionist temporal classification (CTC) head to maintain alignment between transcription tokens and inline entity tags, enabling the identification of named entities directly within transcripts. Such capabilities make it a powerful tool for applications in voice recognition, intelligent assistants, and linguistics research, though it currently faces challenges in generalizability to more spontaneous and varied Mandarin speech scenarios.
Loading comments...
loading comments...