Show HN: Skeletoken, a Package for Editing Tokenizers (github.com)

0 points 4 days ago ago | visit original

🤖 AI Summary

Skeletoken is a new Python package that makes editing Hugging Face tokenizers safe, structured, and programmatic. Instead of hand-editing complex tokenizer.json files and chasing cryptic parse errors, Skeletoken provides pydantic-based datamodels (TokenizerModel) that mirror the tokenizers package constraints—so anything you can build with Skeletoken can be parsed by tokenizers. It also surfaces far more actionable validation errors while you progressively modify tokenizers, which is especially useful for making nontrivial changes reproducibly during research or deployment. Technically, Skeletoken can load tokenizers from the Hub (TokenizerModel.from_pretrained), convert between model representations (to_tokenizer), and apply higher-level transforms: add pre-tokenizers (e.g., a DigitsPreTokenizer to split digits), decase vocabulary (make tokens lowercase), or convert a model to a greedy tokenizer. The project already includes automated lowercasing and helpers for adding modules; planned work includes vocabulary edits and cross-checks (e.g., ensuring merges/AddedTokens remain consistent). It’s pip-installable (pip install skeletoken), MIT-licensed, and aimed at making tokenizer surgery safer, auditable, and easier to integrate into ML pipelines.

Loading comments...

loading comments...