In Defense of Tokenizers (huggingface.co)

🤖 AI Summary
The piece argues that tokenization is not a marginal or broken relic of NLP but an unavoidable, understudied design choice that shapes model behavior. It traces why we moved from whitespace-word tokens (huge vocabularies, OOVs and UNK tokens) to morpheme ideas (nice for agglutinative languages but impractical without language-specific parsers) to subword methods (fixed-size learned vocabularies that avoid OOVs but produce non-intuitive splits). It also explains character- and UTF‑8/byte-level approaches — which solve OOVs but dramatically lengthen sequences (and for some scripts, e.g., Burmese, can use >4× more bytes than English) — and surveys so‑called “tokenizer-free” work (ByT5, CANINE, Charformer, CharBERT, MegaByte, BLT, Byte Latent Transformer, H‑Net), showing these still rely on base unit embeddings (bytes/characters) and downstream chunking, so they are not genuinely free of tokenization. The post highlights how Unicode/UTF‑8 itself is a human design with biases and that tokenization choices implicitly encode political and engineering trade-offs. For the AI/ML community this matters because tokenization affects multilingual coverage, handling of neologisms/misspellings, sequence length and compute, and downstream model generalization. The author calls for more careful study and cross-pollination between static subword research and dynamic/tokenizer‑light approaches (empirical comparisons of chunking behaviors are largely missing). More researchers engaging with tokenizers — and valuing upstream data and encoding work — will speed progress toward more robust, equitable language models.
Loading comments...
loading comments...