Granite docling 258M: a small multimodal model for efficient document conversion (huggingface.co)

0 points 2 hours ago ago | visit original

🤖 AI Summary

IBM Research released Granite Docling 258M, a compact multimodal Image-Text-to-Text model designed for efficient document conversion and full compatibility with the Docling ecosystem (Apache 2.0, released Sept 17, 2025). Built on the IDEFICS3 blueprint, it swaps in a siglip2-base-patch16-512 vision encoder and a Granite 165M LLM, connected via a pixel‑shuffle projector and trained with the nanoVLM framework on IBM’s Blue Vela cluster (NVIDIA H100s). The model consolidates multiple single-purpose converters into one VLM and is accessible through the Docling CLI/SDK or runtimes like transformers, vllm and MLX. Granite Docling emphasizes document-structure tasks: improved equation recognition and inline math, robust code recognition, near-perfect table structure (TEDS 0.97), better OCR/ layout F1 scores, and document-element QA. Key gains stem from curated synthetic corpora (SynthCodeNet, SynthFormulaNet, SynthChartNet) plus DoclingMatix real pages and DocTags SFT to speed convergence. It supports flexible inference (full-page or bbox-guided), fast batch inference with vllm, and experimental Japanese/Arabic/Chinese. For the AI/ML community this demonstrates that small, task-focused VLMs—when paired with targeted synthetic data and efficient training—can rival larger systems on document understanding, enabling lower-cost, faster pipelines for math-, code-, and table-heavy documents while remaining integrated and reproducible via Docling tools and eval suites.

Loading comments...

loading comments...