Learning to Model the World with Language (dynalang.github.io)

🤖 AI Summary
Researchers introduced Dynalang, a model-based agent that learns to "model the world with language" by predicting future text and visual observations and using imagined rollouts to plan actions. Built on DreamerV3, Dynalang compresses each timestep's image and text token into a latent, trains to reconstruct observations, predict rewards, and forecast the next latent, and then trains a policy on imagined trajectories. Crucially, it treats video frames and text tokens as a single multimodal sequence (one image + one token at a time), enabling language-conditioned prediction, unified language generation, and text-only pretraining without actions or rewards. This approach is significant because it expands language use beyond instruction-following to include descriptions, rules, corrections, and dynamics, grounding diverse language in future prediction. Dynalang outperforms language-conditioned RL baselines (IMPALA, R2D2) and task-specific models (EMMA) on benchmarks like HomeGrid (language hints), Messenger (multi-hop text reasoning), and Habitat (photorealistic navigation). Practical implications include improved sample efficiency, the ability to leverage large offline text/video corpora (TinyStories pretraining improved downstream RL vs. T5 embeddings), and unified embodied language generation (LangRoom). The work suggests a scalable path for agents that learn general language-grounded world models for both acting and speaking.
Loading comments...
loading comments...