đ¤ AI Summary
A formal, byte-level analysis compares TOON (a compact, indented data format) against compact JSON to quantify structural overhead and its implications for LLM workloads (bytes â a firstâorder proxy for tokenizer tokens and inference cost). Under controlled assumptionsâcompact JSON, canonical TOON (2âspace indent, one space after â:â), ASCII/simple keys that can be unquoted in TOON, and shallow-to-moderate nestingâthe paper derives closedâform byte-length functions and shows TOON reduces structural bytes in most common patterns. That reduction implies lower token counts and potentially cheaper LLM inference when data is dominated by repetitive keys or flat/tabular structures.
Key technical takeaways: TOONâs biggest wins are tabular arrays (declare column names once and stream rows), where savings scale linearly with both row count and field count (example: ~12 MB saved for 1M rows Ă 2 fields). Flat objects, primitive arrays, and root arrays also show consistent per-field/element byte reductions. Downsides: TOON pays per-element overhead for arrays-of-arrays (~~6 extra bytes per inner array), so JSON is smaller there; each nesting level adds 2 bytes of indentation per line, so deep nesting can flip the advantage to JSON. Strings that resemble literals must be quoted in both formats, reducing TOONâs margin. The model is intentionally simplifiedâreal datasets, exponent notation, tokenizer quirks, and benchmarks may narrow or reverse gainsâso empirical Benchmarks are recommended to validate practical impact.
Loading comments...
login to comment
loading comments...
no comments yet