Knowledge Insulating Vision-Language-Action Models: Train, Run Fast, Generalize [pdf] (www.physicalintelligence.company)

0 points 16 hours ago ago | visit original

🤖 AI Summary

Researchers from Physical Intelligence introduce "knowledge insulation," a training recipe for vision-language-action (VLA) models that preserves pretrained vision-language model (VLM) knowledge while enabling fast continuous control. They show that naively attaching continuous-output modules (diffusion or flow-matching "action experts") to a pretrained VLM slows training and degrades language understanding because gradients from randomly initialized continuous heads interfere with the backbone. Instead, they co-train the backbone using discretized next-token prediction on action tokens (leveraging standard VLM objectives and web-scale vision-language data) while simultaneously training a separate continuous action expert with flow-matching/diffusion losses—but crucially block gradients from that expert into the backbone (stop-gradient). This “insulation” yields three practical wins: much faster and more stable fine-tuning, preserved transfer of semantic language-vision knowledge, and a small, fast action expert at inference that produces high-frequency continuous commands for dexterous robot control. Experiments on long-horizon manipulation, mobile bimanual tasks, and benchmarks (DROID, LIBERO) built on the π architecture and ablations demonstrate that keeping both discrete action supervision for representation learning and an insulated continuous expert for execution is crucial. The method offers a general recipe for integrating new continuous modalities into large pretrained models without catastrophic interference, improving scalability and real-time applicability of multimodal robotic systems.

Loading comments...

loading comments...