LoCaL: Countering Surface Bias in Code Evaluation Metrics (arxiv.org)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Researchers analyzed four state-of-the-art reference-based code evaluation metrics (CEMs) and found they are heavily biased toward surface-level similarity (e.g., token or syntactic resemblance) rather than true functional equivalence. To probe this gap they introduced LoCaL (Looks Can Lie), a targeted benchmark of 3,117 code pairs at both method and program granularity designed to include surface-similar but functionally different examples and vice versa. Instead of relying on expensive hand-written test suites, the authors compute functional similarity via differential fuzzing—automatically generating and executing many more tests than prior work—producing more reliable, execution-driven labels without predefined cases. On LoCaL all four evaluated CEMs suffer marked performance drops versus prior baselines, confirming that current reference-based metrics can be misled by superficial code resemblance and overestimate correctness. The work highlights a practical evaluation blindspot in LLM-driven code generation research: cheaper reference-based scoring can fail to capture semantics unless benchmarks explicitly include deceptive pairs. The authors suggest that exposing CEMs to LoCaL-like data during development or training could foster metrics that are robust to surface bias, improving the fidelity of automated code evaluation for AI-generated programs.

Loading comments...

loading comments...