🤖 AI Summary
JetBrains describes "token healing," an inference-time fix that makes LLM-powered autocomplete respect partially typed words without expensive retraining or character-level models. The problem: tokenizers split common words (e.g., "Node") into sub-tokens when a developer types a prefix ("Nod"), producing token sequences the model never saw during training and leading to wrong completions. Training solutions (fine-tuning, subword regularization) are incomplete, slow, or harmful to other tasks, and character-level modeling is too slow and context-limited.
Their practical solution constrains sampling at inference: single-token healing forces the next sampled token to start with a given suffix; multi-token healing generalizes this by allowing only tokens that either start with the user prefix or are substrings of it (token.startswith(prefix) or prefix.startswith(token)). Disallowed tokens get logits = -inf, preserving relative probabilities among allowed tokens and compatibility with TensorRT-LLM and speculative decoding. The initial implementation was slow because it scanned 150k tokens; they solved this with a trie (O(query length)) implemented in Python (~30 MB) plus caching for high-cardinality prefixes. Typical queries are <300 µs, worst-case <1 ms, restoring sub-100 ms autocomplete while reliably completing partial identifiers like "Nod" → "Node."
Loading comments...
login to comment
loading comments...
no comments yet