🤖 AI Summary
Researchers ran a controlled experiment to find which table formats LLMs understand best by feeding GPT-4.1-nano 1,000 synthetic employee records (8 attributes each) and asking 1,000 randomized field-retrieval questions across 11 formats (JSON, CSV, XML, YAML, HTML, markdown table, a "markdown-KV" key:value style, INI, pipe-delimited, JSONL and natural language). Accuracy varied widely: markdown-KV led at 60.7% (52,104 tokens), CSV and JSONL performed poorly (44.3% / 45.0%; CSV used 19,524 tokens), and formats like XML/INI/YAML sat in the mid-50s. The best format was ~16 percentage points higher than CSV but required ~2.7Ă— more tokens, exposing a trade-off between accuracy and token cost.
This matters for RAG pipelines and data-centric AI systems where format choice affects extraction accuracy, latency and API cost. Practical takeaways: try simple format transformations (markdown-KV favored when accuracy is critical; markdown tables balance readability and cost), consider chunking and repeating headers for large tables, and avoid assuming CSV/JSONL are optimal. Caveats: results are limited to GPT-4.1-nano, one synthetic tabular pattern, and single-field retrieval tasks—different models, nested data, smaller chunks, or other question types may change rankings.
Loading comments...
login to comment
loading comments...
no comments yet