LLM PDF OCR Markdown Book – Turn Scanned PDFs into ePub/Kindle with LLM (github.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

ocr_md_book.py is a compact Python tool that turns a directory of scanned page images into clean, merged Markdown and packages the result as an EPUB (with optional AZW3/MOBI via Calibre). It uses Alibaba DashScope (Tongyi) multimodal models for OCR, applies light post-processing (remove headers/footers/page numbers, fix hard wraps), and outputs per-page Markdown files plus a single merged book.md and book.epub. The script is resumable, handles EXIF auto-rotation and optional downscaling, supports natural image ordering or a custom --from-list, and can process PDFs by converting them to PNGs with pdftoppm. For AI/ML practitioners this is a practical example of integrating cloud multimodal LLM-OCR into a reproducible pipeline: async inference concurrency, model selection (e.g. qwen3-omni-flash), payload handling (base64 vs public URL), and simple text-cleaning heuristics to improve downstream read/write formats. Requirements: Python 3.10+, httpx/pillow/tqdm/pyyaml, DashScope API key, pandoc (required) and Calibre (optional). Important operational notes: use --skip-ocr-existing to resume, watch for HTTP 400 errors if the chosen model needs public URLs, ensure cover path exists for pandoc, and be aware there’s no explicit license—follow component licenses and DashScope terms. The project is useful for researchers and engineers needing reliable digitization and EPUB production with LLM-powered OCR.

Loading comments...

loading comments...