Language Modeling with Hierarchical Reasoning Models: Lessons from 1M Parameters (williamthurston.com)

0 points 4 days ago ago | visit original

🤖 AI Summary

Researcher adapted Tiny Recursive Models (TRM), a hierarchical/recurrent transformer design, to autoregressive language modeling and trained ~1M-parameter variants on the TinyStories dataset to test whether hierarchical reasoning architectures can generate coherent text at extreme small scales. Experiments ran on a MacBook Pro (reproducible locally) and compared a standard transformer baseline, dense TRM (wider FFN + recursion), sparse TRM using MoEUT-style recursion, and carry-refinement variants. All runs used identical training schedules (~2.1M sequences, seq length 512, 31,250 steps, lr=0.01, batch=64). Sparse models used 8 FFN experts (k=2 active) and 2 attention experts (k=1 active); recursion used two layers per prior MoEUT findings. Key findings: hierarchical/recurrent recursion can produce coherent story text at 1M parameters, but architectural complexity didn’t beat a plain transformer baseline—baseline achieved the lowest training loss while TRM dense and MoE variants tied closely behind. Carry refinement and deeper recursion added compute without improving loss or generation quality at this scale. MoEUT-style routing was stable but required a dense-projection workaround on Apple Silicon due to MPS sparse-op issues, inflating compute. Implication: HRM/TRM ideas are viable for tiny LMs but don’t automatically transfer advantages from puzzle reasoning to simple narrative generation; potential gains likely appear on larger scales or reasoning-heavy language tasks (e.g., GSM8K), and efficient sparse hardware support matters for MoE variants.

Loading comments...

loading comments...