Testing Image Generation Models with Explainable Human Evaluation (tiger-ai-lab.github.io)

0 points 1 day ago ago | visit original

🤖 AI Summary

ImagenWorld is a new, community-driven benchmark that stress-tests modern image generation and editing systems with explainable human evaluation. It comprises 3.6K condition sets across six tasks (text-to-image, single- and multi-reference generation, and the corresponding three editing variants) and six domains (artworks, photorealistic, information graphics, textual graphics, computer graphics, screenshots). The dataset is supported by 20K fine-grained human annotations and an evaluation schema that extracts objects and segments from outputs so annotators can tag localized, object-level and segment-level failures instead of only giving opaque scalar scores. Large-scale evaluation of 14 models surfaces clear technical takeaways: editing tasks—especially local edits—are harder than generation, and models struggle with symbolic or text-heavy domains (screenshots, info/textual graphics) where numerical/plot/chart fidelity and readable text are critical. Closed-source systems still lead overall, though targeted data curation (e.g., Qwen-Image) narrows gaps on text-heavy tasks. Vision–language-model (VLM) automated metrics can approximate human ranking (Kendall τ up to ~0.79) but do not provide the fine-grained, explainable error attribution ImagenWorld records. The benchmark therefore functions both as a rigorous evaluation and a diagnostic tool to guide model improvements in instruction fidelity, local compositing, text rendering, and semantic/label consistency.

Loading comments...

loading comments...