Show HN: Contract Extraction Assistant – Local, open-source contract data tool (github.com)

🤖 AI Summary
An open‑source tool called Contract Extraction Assistant was published that provides a local, self‑hosted pipeline for extracting key contract terms (start/end dates, renewal clauses, termination notice) into structured JSON/CSV/PDF. It emphasizes data sovereignty via a BYOK model (default demo uses a shared Mistral key), parallel batch processing, and a hybrid LLM+regex approach for faster, more consistent extractions than single‑file, chat‑style interactions. The repo is MIT‑licensed, Docker‑friendly, and intended for anyone needing scalable, auditable contract pipelines without sending full PDFs to third‑party platforms. Technically, the stack uses a Flask API with PyMuPDF for parsing, Mistral SDK for inference, spaCy for NLP, and a React+Vite frontend; extraction patterns are YAML‑based and a regex fallback augments LLM inference. Key performance highlights: single 12‑page contract ≈3s, batch of 4 contracts (89 pages) ≈10s, with parallel processing giving ~9–11× speedups versus sequential chat workflows. Outputs include source attribution, page snippets, timestamps, and export formats. Caveats: demo mode may log data and use shared keys (don’t upload sensitive docs), snippet extraction reliability is ~50% and broader accuracy testing (500 contracts) is pending. For ML practitioners, this is a practical reference implementation of windowed context prompts, hybrid heuristics, and self‑hosted inference that can be extended to more fields, providers, languages, and audit features.
Loading comments...
loading comments...