RAG Chunk: CLI tool to parse, chunk, and evaluate Markdown documents for RAG (github.com)

0 points 4 hours ago ago | visit original

🤖 AI Summary

Rag-chunk is a new CLI tool (pip install rag-chunk) for preparing Markdown corpora for Retrieval-Augmented Generation (RAG). It parses and cleans Markdown, generates chunks using three built-in strategies (fixed-size, sliding-window, paragraph), and offers recall-based evaluation against a JSON test file of questions + relevant phrases. Outputs can be table, JSON or CSV and generated chunks are written to a temporary .chunks directory. The project ships with an example corpus and CLI defaults (chunk-size 200 words, overlap 50 words, top-k 3) and lets you run all strategies in one pass to compare results. For practitioners, rag-chunk provides a fast way to measure how chunking choices affect retrieval recall: it retrieves top-k chunks using lexical similarity and computes per-question recall = found_relevant_phrases/total_relevant_phrases, with benchmarks (>0.85 excellent, 0.70–0.85 good, etc.). Paragraph chunking often preserves semantic boundaries in well-structured docs, sliding-window helps boundary context, and fixed-size is a consistent baseline. The tool is extensible (add custom chunkers in src/chunker.py and register in STRATEGIES) and currently uses simple whitespace word counts; optional tiktoken-based token-level chunking is planned to better match LLM context limits. This makes rag-chunk a practical, reproducible utility for tuning chunk size/overlap and reducing hallucination risk in RAG pipelines.

Loading comments...

loading comments...