LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing (arxiv.org)

0 points 45 days ago ago | visit original

🤖 AI Summary

A new benchmark called LitBench has been introduced to enhance the evaluation of creative writing generated by large language models (LLMs). As evaluating open-ended narratives presents unique challenges due to the absence of definitive ground truths, LitBench provides a structured dataset designed for reliable assessments. The benchmark includes a training corpus of 43,827 pairs of human preference labels and a test set of 2,480 human-labeled story comparisons, enabling researchers to rigorously evaluate both zero-shot LLM judges and newly designed reward models. This innovation is significant for the AI/ML community as it addresses a critical gap in the assessment of creative AI outputs. Preliminary results indicate that the Claude-3.7-Sonnet model achieved a 73% agreement with human preferences, while the Bradley Terry and generative reward models surpassed this with an accuracy of 78%. The findings, supported by an online human study, show that these trained reward models consistently reflect human preferences in newly generated LLM stories, marking a substantial advancement for automating the evaluation and refinement of creative writing systems. The release of LitBench and its reward models serves as a vital resource for researchers aiming to improve the quality of AI-generated narratives.

Loading comments...

loading comments...