🤖 AI Summary
Activation functions are the “gatekeepers” inside neural nets that introduce the non-linearity required to learn complex patterns; without them, stacked layers collapse to a single linear transform. The field has evolved from Sigmoid/Tanh (which introduced non-linearity but caused vanishing gradients in deep nets) to ReLU (max(0,x)), which solved vanishing gradients by giving positive inputs a unit derivative but introduced the “dying ReLU” problem when neurons output zero permanently. To mitigate that, smoother activations like GELU (a probabilistic smoothing of negative inputs) and Swish (x * sigmoid(x)) were adopted for more stable gradients and nuanced responses.
The current state of the art has shifted from single-path functions to gated mechanisms (GLU variants) that split an input into two projections: one forms a dynamic gate and the other carries information, combined via element-wise multiplication (Activation(xW) ⊗ (xV)). SwiGLU and GEGLU—GLUs that use Swish and GELU as gates—are now common in top Transformers (e.g., LLaMA, PaLM, Gemma), because gating increases feed‑forward expressivity and control over information flow. Practically, this progression shows that activation choice is a critical architectural knob: it affects trainability, stability, and model capacity, and remains an active area for squeezing more performance from large models.
Loading comments...
login to comment
loading comments...
no comments yet