🤖 AI Summary
Researchers released CALM (Continuous Autoregressive Language Models), a new paradigm that replaces discrete token-by-token decoding with autoregressive prediction in a continuous latent space: an autoencoder compresses K-token chunks into a single continuous vector, and a language model predicts those vectors sequentially. In practice the authors demonstrate K=4 (patch_size=4) using an autoencoder (latent_size=128, small encoder/decoder) trained on ~15B tokens and a downstream CALM transformer trained with an energy-based loss. The code, training recipes and pre-trained checkpoints are available on GitHub; reproducing their runs requires large storage (~2.5 TB) and the provided scripts and configs for training/eval (bf16, torchrun, block sizes, batch/accumulation settings) make replication straightforward.
CALM matters because it reduces the number of autoregressive steps by roughly factor K, opening a new “semantic bandwidth” scaling axis beyond model size and data: more information per step can boost efficiency for training and inference. The authors provide a likelihood-free toolkit—high-fidelity autoencoder, energy-based training (preferred over diffusion/flow in their tests), a new BrierLM metric for calibrated likelihood-free evaluation, and a temperature-based black-box sampler for generation. Empirically, a mid-sized CALM-M (371M) attains BrierLM ≈ 5.72 versus ~6.05 for a comparable autoregressive baseline, showing quality gains, though larger CALM variants showed mixed scores, highlighting trade-offs in scaling continuous representations.
Loading comments...
login to comment
loading comments...
no comments yet