Thinking Machines – Modular Manifolds (thinkingmachines.ai)

🤖 AI Summary
Researchers propose constraining neural network weight matrices to geometric submanifolds and co-designing optimizers that operate on those manifolds — a concrete instantiation is "manifold Muon," which constrains weights to the Stiefel manifold (matrices with all singular values = 1). The motivation is practical: keeping weights, activations and gradients on predictable scales reduces exploding/vanishing behavior, simplifies hyperparameter tuning, enforces small condition numbers (better numerical predictability and Lipschitz guarantees), and can improve robustness. The authors frame this as a modular approach — "modular manifolds" — that makes it easier to compose per-layer constraints for large-scale models. Technically, the post walks through manifold optimization from a hypersphere warmup to matrix-valued parameters. Manifold optimizers take steps in the tangent space, choose a Riemannian distance (metric) that changes the optimal update direction, then retract back to the manifold; for the hypersphere the optimal tangent update is a_opt = −η (g − w w^T g)/||g − w w^T g|| and retraction rescales by 1/√(1+η^2). For matrices, the Stiefel manifold is defined by W^T W = I, with tangent condition A^T W + W^T A = 0; using the spectral norm as the distance yields updates that control the maximal stretching of inputs. Manifold Muon generalizes spectral-normalized Muon to Stiefel-constrained weights, and the post highlights many open directions for integrating manifold constraints into large-model training.
Loading comments...
loading comments...