GLM-OCR – A multimodal OCR model for complex document understanding (github.com)

0 points 47 days ago ago | visit original

🤖 AI Summary

GLM-OCR has been unveiled as a cutting-edge multimodal OCR model designed for complex document understanding. Built on the GLM-V encoder-decoder architecture, it introduces innovations like Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning, enhancing training efficiency and recognition accuracy. The model leverages the CogViT visual encoder, a lightweight cross-modal connector, and a GLM-0.5B language decoder within a two-stage pipeline for layout analysis and parallel recognition, leading to impressive performance on diverse document layouts. This development is significant for the AI/ML community as GLM-OCR achieves a commendable score of 94.62 on the OmniDocBench V1.5 benchmark, making it the top performer in document understanding tasks, including formula and table recognition. The model is optimized for real-world business applications, excelling in handling complex documents, reducing inference latency with only 0.9B parameters, and supporting seamless deployment via popular platforms like vLLM and Ollama. Fully open-sourced with an SDK for easy integration, GLM-OCR empowers developers to enhance their document processing pipelines efficiently while ensuring state-of-the-art results across various real-world scenarios.

Loading comments...

loading comments...