Toon vs. JSON: Byte-Level Efficiency Model (toonformat.dev)

🤖 AI Summary
A formal, byte-level analysis compares TOON (a compact, indented data format) against compact JSON to quantify structural overhead and its implications for LLM workloads (bytes ≈ a first‑order proxy for tokenizer tokens and inference cost). Under controlled assumptions—compact JSON, canonical TOON (2‑space indent, one space after “:”), ASCII/simple keys that can be unquoted in TOON, and shallow-to-moderate nesting—the paper derives closed‑form byte-length functions and shows TOON reduces structural bytes in most common patterns. That reduction implies lower token counts and potentially cheaper LLM inference when data is dominated by repetitive keys or flat/tabular structures. Key technical takeaways: TOON’s biggest wins are tabular arrays (declare column names once and stream rows), where savings scale linearly with both row count and field count (example: ~12 MB saved for 1M rows × 2 fields). Flat objects, primitive arrays, and root arrays also show consistent per-field/element byte reductions. Downsides: TOON pays per-element overhead for arrays-of-arrays (~~6 extra bytes per inner array), so JSON is smaller there; each nesting level adds 2 bytes of indentation per line, so deep nesting can flip the advantage to JSON. Strings that resemble literals must be quoted in both formats, reducing TOON’s margin. The model is intentionally simplified—real datasets, exponent notation, tokenizer quirks, and benchmarks may narrow or reverse gains—so empirical Benchmarks are recommended to validate practical impact.
Loading comments...
loading comments...