Stochastic Activations (gonzoml.substack.com)

🤖 AI Summary
Researchers propose “stochastic activations” as a practical middle ground between high-quality smooth activations (SILU/swish, SwiGLU) and sparse-but-piecewise ReLU. They introduce two methods: Swi+FT (activation fine-tuning)—train most of the way with a smooth activation (SILU) then switch to ReLU for the final 5–20% of steps with a cosine LR decay—and StochA, which samples between two activations (e.g., SILU vs ReLU) via a Bernoulli(p) during training or inference. Experiments on 1.5B and 3B decoder LLMs (RMSNorm pre-norm, RoPE, AdamW, Llama3 tokenizer, pretraining on 47B–80B tokens, 8k context) show SILU still yields the lowest loss and best downstream quality; stochastic schemes beat plain ReLU but typically fall between ReLU and SILU. Switching causes a transient loss spike that rapidly recovers. The key practical payoff is efficiency: using ReLU at inference yields >90% activation sparsity in FFN layers, translating to ~65% CPU inference speedup in their tests (GPU benefits need further engineering or tensor-core sparsity support). StochA also enables test-time diversity (multiple sampled outputs per prompt). Important open points include implementation details for sparse FFNs, memory-access vs sparse-matmul tradeoffs, and how best to convert dense-trained weights to sparse inference formats. The work highlights a concrete quality-efficiency tradeoff and suggests more systematic comparisons of modern activations and sparse inference engineering are timely.
Loading comments...
loading comments...