I Was Wrong: Start Simple, Then Move to More Complex (charles-frenzel.medium.com)

0 points 8 hours ago ago | visit original

🤖 AI Summary

The author revisits a prior embedding-heavy solution (DenseClus) for clustering mixed tabular data and concedes it was unnecessarily complex: multiple UMAP runs are stochastic, require extensive preprocessing and hyperparameter tuning, and add runtime overhead. They advocate starting with Gower distance—a deterministic, interpretable, zero‑hyperparameter metric designed for mixed numeric/categorical data—before resorting to embeddings. Gower’s tradeoff is clear: it’s O(N² × F) to build the full distance matrix (so memory and compute scale quadratically), but it yields reproducible results, easier debugging, and faster iteration because you skip embedding steps. To make Gower practical at scale, the author open-sourced Gower Express (pip install gower_exp[gpu,sklearn]), an optimized implementation that’s ~20% faster and uses ~40% less memory via Numba JIT, optional GPU/CuPy acceleration, scikit-learn compatibility, automatic feature‑type detection, and missing‑value handling. It also provides a top‑N heap routine to avoid full quadratic work when you only need nearest neighbors. Example timings: 100K rows ≈ 45s CPU / 12s GPU (1.2GB); 1M rows ≈ 18min CPU / 3.8min GPU (8GB). The takeaway: for most mixed‑type clustering tasks, try Gower first for reproducibility and interpretability; only move to embedding approaches when you hit specific limitations.

Loading comments...

loading comments...