Structured Outputs in LLMs (parthsareen.com)

🤖 AI Summary
Ollama’s engineer behind Gemma3 describes launching structured outputs in December and building Ollama’s sampler in March 2025, plus research on on‑the‑fly structured outputs using finite state machines (FSMs) and current work to support “thinking” models. The piece ties sampling and structured outputs together: after a forward pass the model emits logits which are transformed (topK → temperature → softmax → topP → minP → greedy/random sampling) and then a token is selected. Ollama uses JSON schemas compiled into grammars to mask invalid tokens during sampling — a sampled token is checked against the grammar and, if invalid, the model is re‑sampled with masked tokens (slower but guarantees valid output). Practical CPU optimizations include applying topK first (heap construction) to reduce vocabulary work; attempts to fuse temperature and softmax didn’t yield wins. This matters because structured outputs let LLMs reliably produce machine‑parseable data (JSON, tables) for scraping, parsing, and integrations — a growing production requirement. FSMs and grammar masking can make constrained generation efficient and robust, but “thinking” models complicate things: techniques like prefilling <think> tags or constraining only after the internal reasoning phase can preserve model reasoning while still producing valid output. The author predicts models will improve at native structured generation over time, reducing the need for token‑level masking, but for now grammar‑based sampling remains a practical way to guarantee correctness.
Loading comments...
loading comments...