Linear-time tokenizer crate by GitHub (github.blog)

🤖 AI Summary
GitHub announced an open‑source Rust implementation of a byte-pair encoding (BPE) tokenizer that runs in linear time and supports incremental operations, publishing crates bpe and bpe-openai (MIT) to address scaling and robustness needs for GitHub Copilot and large-scale retrieval-augmented generation (RAG). Traditional BPE implementations are quadratic or O(n log n) and require the full input up front, which makes dynamic tasks like chunking large code files, budgeted prompt building, incremental token counting, and worst‑case safety (DoS risk) expensive or unsafe at web scale. GitHub’s tokenizer lets you append/prepend text, count tokens for slices, and abort when limits are exceeded — with constant‑time access to current token counts and snapshotting. The core technical insight is a “compatibility” property that permits left‑to‑right dynamic programming: store the last token for every prefix and determine valid last tokens using compatibility checks. Implementation uses an Aho‑Corasick automaton to enumerate suffix tokens and efficient on‑the‑fly retokenization of token pairs, keeping overlapping matches and retokenization bounded by constants from the dictionary — yielding true O(n) runtime. In single‑threaded benchmarks (o200k_base model), GitHub’s crate outperforms tiktoken-rs by ~4× and Hugging Face tokenizers by ~10× in practical tests, while avoiding the quadratic worst‑case behavior of competing tools. This improves throughput, latency, cost, and safety for large-scale tokenization, embeddings indexing, and RAG pipelines.
Loading comments...
loading comments...