Show HN: Lightning-Fast Diarization on Apple Silicon (github.com)

🤖 AI Summary
Senko is an open-source, high-performance speaker diarization pipeline (forked from 3D-Speaker) that emphasizes speed on both NVIDIA GPUs and Apple Silicon. The project claims processing 1 hour of audio in 5 seconds on an RTX 4090 + Ryzen 9 7950X (~17× faster than Pyannote 3.1) and 7.7 seconds on an Apple M3 (~42× faster). Reported diarization error rates (DER) are competitive: 13.5% on VoxConverse, 13.3% on AISHELL-4, and 26.5% on AMI-IHM. Senko ships as a pip package (with variants for CUDA/Apple), supports Linux/macOS/WSL and Python 3.11.13, and integrates with the Zanshin player for visualizing diarization output. Technically, Senko keeps a four-stage pipeline—VAD, C++ multithreaded fbank extraction, batched CAM++ speaker-embedding inference, and clustering (spectral or UMAP+HDBSCAN)—and adds optimizations: VAD options (Pyannote seg-3.0 or Silero), upfront C++ feature extraction, batched GPU inference, and optional RAPIDS GPU clustering for supported NVIDIA cards. On macOS the VAD and embeddings run through CoreML/ANE; fbank and clustering stay on CPU. The author highlights a practical CPU–GPU orchestration bottleneck (CPU handles batching/padding), so pairing a fast CPU with a powerful GPU yields best results. Current limitations: no overlapping-speaker output yet, best results on English/Mandarin and high-fidelity audio, and potential mis-clustering for near-identical or multi-mic voices.
Loading comments...
loading comments...