LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures (arxiviq.substack.com)

🤖 AI Summary
LLM-JEPA introduces a hybrid training objective that brings Joint Embedding Predictive Architectures (JEPAs) from vision into the LLM world by combining the standard autoregressive next-token loss with a JEPA embedding-prediction loss. The method treats pairs of related “views” (e.g., a natural-language description and its code implementation) and trains the model to predict the embedding of one view from another using the LLM itself as encoder and predictor: a special [PRED] token (k predictor tokens appended) produces the predicted embedding, and cosine similarity measures distance to the target embedding. This avoids heavy extra networks and focuses learning on abstract, cross-view semantics rather than raw token reconstruction. Empirically, LLM-JEPA yields consistent gains across models (Llama3, Gemma2, OpenELM, OLMo) and datasets (NL‑RX, GSM8K, Spider), improving pretraining and fine-tuning performance, accelerating PEFT convergence (LoRA rank 512 close to full fine-tune), and dramatically reducing overfitting. Analyses show cleaner, near-linear alignment between Text and Code embedding clusters, indicating more structured, transferable representations. The main limitation is compute: three forward passes (generative + two views) raise training cost ~3×; proposed next steps include single-pass attention masking and data augmentation to generate non-trivial views, making the approach viable at scale. Overall, LLM-JEPA opens a new direction for embedding-space self-supervision in language models.
Loading comments...
loading comments...