The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer (arxiv.org)

0 points 1 day ago ago | visit original

🤖 AI Summary

Recent research titled "The Benchmark Illusion" highlights a critical issue in evaluating pruned large language models (LLMs). While these compressed models can excel in multiple-choice assessments, they often struggle with open-ended responses. The study investigates how high-sparsity pruning affects model performance, revealing that pruned LLMs can misinterpret the task at hand. Specifically, when using greedy strategies for open generation, the models frequently fail to output the correct answer despite still being able to recognize it in multiple-choice formats. This finding is significant for the AI/ML community as it uncovers a potential overestimation of the effectiveness of pruned models based on current benchmarking practices. Researchers found that when LLMs are subjected to high-sparsity pruning methods, like Wanda, the correct answers are not erased but rather demoted within the generation process. This underscores the importance of comprehensive evaluation strategies that measure a model's ability to produce answers rather than merely recognize them. The study advocates for re-evaluating how AI models are benchmarked, to prevent misleading assessments of their capabilities in real-world applications.

Loading comments...

loading comments...