TheWhisper: High-Performance Speech-to-Text (github.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

TheWhisper is an open-source, fine-tuned Whisper-based speech-to-text suite optimized for low-latency, low-power, and scalable streaming inference across cloud, self-hosted servers, and on-device (Apple Silicon) deployments. The release includes Hugging Face weights that handle flexible chunk sizes (10/15/20/30s without padding), streaming pipelines, a simple Python API, an Electron/React demo (TheNotes for macOS), and ready-to-run containers (including Jetson). It targets real-time captioning, meetings, voice interfaces and edge use cases where privacy and power matter. Technically, TheWhisper ships optimized engines for NVIDIA (TheStage AI ElasticModels) and CoreML for macOS/Apple Silicon (~2 W power and ~2 GB RAM on-device), with reported throughput like ~220 tok/s on L40s for whisper-large-v3. Streaming is supported on macOS and NVIDIA; word-level timestamps are available on Apple builds. Benchmarks show ASR quality comparable to OpenAI’s Whisper (mean WERs ~7.3–7.9 across variants), so performance gains are largely in latency, power and deployment flexibility rather than accuracy. Supported stack: RTX 4090/L40s, Ubuntu 20.04+, CUDA ≥11.8, drivers ≥520, Python 3.10–3.12, and macOS/iOS on modern M-series/M4 chips. TheStage’s optimized engines are free for small orgs up to 4 GPUs/year; larger deployments require commercial licensing.

Loading comments...

loading comments...