Which Table Format Do LLMs Understand Best? (Results for 11 Formats) (www.improvingagents.com)

0 points 10 hours ago ago | visit original

🤖 AI Summary

Researchers ran a controlled experiment to find which table formats LLMs understand best by feeding GPT-4.1-nano 1,000 synthetic employee records (8 attributes each) and asking 1,000 randomized field-retrieval questions across 11 formats (JSON, CSV, XML, YAML, HTML, markdown table, a "markdown-KV" key:value style, INI, pipe-delimited, JSONL and natural language). Accuracy varied widely: markdown-KV led at 60.7% (52,104 tokens), CSV and JSONL performed poorly (44.3% / 45.0%; CSV used 19,524 tokens), and formats like XML/INI/YAML sat in the mid-50s. The best format was ~16 percentage points higher than CSV but required ~2.7× more tokens, exposing a trade-off between accuracy and token cost. This matters for RAG pipelines and data-centric AI systems where format choice affects extraction accuracy, latency and API cost. Practical takeaways: try simple format transformations (markdown-KV favored when accuracy is critical; markdown tables balance readability and cost), consider chunking and repeating headers for large tables, and avoid assuming CSV/JSONL are optimal. Caveats: results are limited to GPT-4.1-nano, one synthetic tabular pattern, and single-field retrieval tasks—different models, nested data, smaller chunks, or other question types may change rankings.

Loading comments...

loading comments...