🤖 AI Summary
Researchers analyzed four state-of-the-art reference-based code evaluation metrics (CEMs) and found they are heavily biased toward surface-level similarity (e.g., token or syntactic resemblance) rather than true functional equivalence. To probe this gap they introduced LoCaL (Looks Can Lie), a targeted benchmark of 3,117 code pairs at both method and program granularity designed to include surface-similar but functionally different examples and vice versa. Instead of relying on expensive hand-written test suites, the authors compute functional similarity via differential fuzzing—automatically generating and executing many more tests than prior work—producing more reliable, execution-driven labels without predefined cases.
On LoCaL all four evaluated CEMs suffer marked performance drops versus prior baselines, confirming that current reference-based metrics can be misled by superficial code resemblance and overestimate correctness. The work highlights a practical evaluation blindspot in LLM-driven code generation research: cheaper reference-based scoring can fail to capture semantics unless benchmarks explicitly include deceptive pairs. The authors suggest that exposing CEMs to LoCaL-like data during development or training could foster metrics that are robust to surface bias, improving the fidelity of automated code evaluation for AI-generated programs.
Loading comments...
login to comment
loading comments...
no comments yet