EnvTrace: Simulation-Based Semantic Evaluation of LLM Code (arxiv.org)

0 points 245 days ago ago | visit original

🤖 AI Summary

EnvTrace is a new evaluation framework that judges LLM-generated control code by running it in a high-fidelity simulator and aligning the resulting execution traces against ground-truth traces to measure semantic equivalence. Unlike conventional stateless unit tests, EnvTrace captures stateful behaviors of physical systems (sequenced actions, state transitions and safety-relevant responses) and produces a multi‑faceted score for functional correctness. The authors demonstrate the method using a digital twin of synchrotron beamline control logic, which not only lets them benchmark models but also enables pre‑execution validation of live experiments for safer deployment. Technically, EnvTrace compares the temporal series of commands and state changes produced by an LLM’s code to expected traces, extracting behavioral metrics that go beyond syntax or single-step outputs. Applied to over 30 LLMs, the approach shows many top-tier models approaching human-level performance for rapid control-code generation, while revealing nuanced failure modes that unit tests miss. The work points to a practical symbiosis: LLMs supplying intuitive control and orchestration, and digital twins providing safe, high‑fidelity environments for validation — a step toward autonomous embodied AI in scientific instrumentation.

Loading comments...

loading comments...