An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action RPGs (arxiv.org)

0 points 8 hours ago ago | visit original

🤖 AI Summary

Researchers introduced CombatVLA, an efficient Vision-Language-Action (VLA) model designed for real-time combat tasks in 3D action role‑playing games. CombatVLA is a 3B-parameter model trained on video–action pairs collected with a custom action tracker; training data are formatted as “action‑of‑thought” (AoT) sequences that explicitly link perceptual inputs to tactical action decisions. The model is deployed inside an action execution framework and uses a truncated AoT inference strategy to cut latency, enabling a reported 50× speedup in in‑game combat while outperforming prior models on a combat understanding benchmark and achieving a higher task success rate than human players. This work is significant because it tackles three core VLA challenges—sub‑second decision-making, high‑resolution perception, and dynamic tactical reasoning—using a compact, practical model rather than massive multimodal LLMs. Key technical contributions are the AoT data representation (analogous to chain‑of‑thought but for actions), the action tracker for scalable video–action collection, and the truncated AoT method that trades off foresight for real‑time responsiveness. The team plans to open‑source the tracker, dataset, benchmark, weights, and code, which could accelerate reproducible research and transfer to other embodied AI domains such as robotics, autonomous agents, and interactive game AI.

Loading comments...

loading comments...