Using UUIDs in prompts is bad (boundaryml.com)

🤖 AI Summary
LLMs struggle with high-entropy identifiers: a single UUID costs ~24 tokens in common tokenizers, while a 3-digit integer can be a single token. That makes prompts bloated and increases error rates when models must read or reproduce IDs. The article demonstrates a simple fix—remap UUIDs to small integer IDs before calling the model, then translate integers back in your application. Implementation steps: collect UUIDs, deduplicate and assign integer IDs, replace UUIDs in the prompt, call the LLM, and map response integers back to UUIDs. This keeps prompts compact (works best when you have <1000 unique IDs so each fits in one token) and preserves application-level UUIDs. In experiments on a 200-item aggregation task, direct UUIDs produced ~48.5 average errors using Claude Haiku, while using integers (or remapping UUID→int) reduced errors to ~6 and ~5.5 respectively. Model choice still matters—Opus 4 reached 100% accuracy with integers but only 80% with UUIDs—so benchmark per model and task. Use remapping when your task requires the model to reproduce IDs accurately and you’re seeing ID-related errors; for >1000 unique IDs or read-only ID use-cases, consider batching or other strategies. The author also proposes adding automatic UUID→int remapping to BAML to make this pattern seamless.
Loading comments...
loading comments...