TinyWorlds: Reimplemented Genie 3 from Scratch (github.com)

🤖 AI Summary
TinyWorlds is an open-source, minimal reimplementation of DeepMind’s Genie world-model architecture that packages a scalable, autoregressive approach to video/dynamics prediction into a compact, learnable codebase. It’s designed as an educational and experimental drop-in for researchers who want to see how state-of-the-art LLM techniques (transformers, discrete tokenization, iterative masked sampling) can be applied to unsupervised world modeling. By turning frames and actions into discrete tokens, TinyWorlds reduces the prediction problem from regressing high-dimensional pixel values to selecting from a ~1000-entry vocabulary, making dynamics modeling and sampling tractable and amenable to LLM-style scaling and tooling. Technically, the system has three core modules: a video tokenizer (an FSQ VAE using pixel-mixing convolutions + a Space-Time Transformer that applies spatial attention per-frame and temporal attention across timesteps), an action tokenizer (an FSQ VAE that infers discrete action tokens between frames via masking and auxiliary losses), and an autoregressive dynamics model that predicts next-frame tokens conditioned on past frame and action tokens. Conditioning uses FiLM, normalization uses RMSNorm, and inference uses an iterative MaskGIT-like sampling schedule (exponential k per step). The repo includes training/inference scripts, Hugging Face assets (PicoDoom, Pong, Zelda, Sonic, etc.), and support for Torch compile, DDP, AMP and TF32. TinyWorlds is lightweight but extensible (MoE, RoPE/AliBi, FSDP, optimizer/scheduler experiments), making it a practical playground for advancing and understanding scalable world models.
Loading comments...
loading comments...