Show HN: Chonky – a neural text semantic chunking goes multilingual (huggingface.co)

0 points 5 days ago ago | visit original

🤖 AI Summary

Chonky is a new multilingual transformer model and accompanying Python library that segments long text into semantically coherent chunks for use in retrieval-augmented generation (RAG) pipelines. Available as mirth/chonky_mmbert_small_multilingual_1, it can be used via a simple ParagraphSplitter wrapper or as a token-classification (NER-style) pipeline that emits "separator" tokens. The model was fine-tuned on sequences of length 1024 using MiniPile, BookCorpus and Project Gutenberg data and trained on a single H100 for several hours, making it lightweight and easy to integrate into embedding + retrieval workflows. Technically, Chonky reports token-based F1 validation scores that are strong across many languages (e.g., de 0.88, es 0.91, fr 0.93, ru 0.97, en 0.78) but shows a notable weakness on Chinese (zh 0.11). Comparisons to other Chonky variants (modernbert/distilbert) show the multilingual mmBERT small model substantially outperforms earlier bases on standard corpora. Practical caveats: the model was fine-tuned at 1024 tokens (though mmBERT can support longer contexts) and quality will vary by domain and language. For practitioners building multilingual RAG systems, Chonky offers an off-the-shelf, efficient chunker that improves semantic chunk boundaries for embeddings and downstream LLM prompting, but you should validate performance on your target languages and domains.

Loading comments...

loading comments...