Energy-Based Transformers Are Scalable Learners and Thinkers (alexiglad.github.io)

🤖 AI Summary
Researchers introduce Energy-Based Transformers (EBTs), a new class of models that combine energy-based models (EBMs) with Transformer architectures and a scalable training recipe to enable unsupervised "System 2" thinking across modalities. Instead of directly generating outputs, EBTs learn a verifier — an energy function that scores context–prediction pairs — and produce predictions by iteratively minimizing energy (gradient-descent optimization on the prediction). This approach gives three cognitive-like capabilities: dynamic computation allocation (think longer by running more optimization steps), explicit uncertainty via continuous energy scores, and in-place prediction verification — all learnable without task-specific or human-provided rewards. Empirically, EBTs outperform a strong feed-forward Transformer++ baseline on language modeling and vision tasks, improving every next-token/frame prediction and showing stronger out-of-distribution generalization. Notably, EBTs scale more efficiently: up to ~35% faster data scaling (e.g., needing <20T tokens versus ~30T for equivalent perplexity), better downstream performance at equal pretraining perplexity, and higher scaling rates for width/depth/parameters and some CV metrics. The authors attribute this to verification being easier than direct generation and to increased model flexibility. Caveats include stability and further scaling work required, but EBTs promise a data-efficient, modality-agnostic path to emergent reasoning through unsupervised verifier-based optimization.
Loading comments...
loading comments...