Text or pixels? On the token efficiency of visual text inputs in multimodal LLMs (arxiv.org)

🤖 AI Summary
Researchers show that rendering long text as an image and feeding that image into multimodal decoder LLMs is a practical way to compress inputs: by supplying “text-as-image” instead of many text tokens, models can often cut the number of decoder tokens roughly in half without hurting task performance. The paper evaluates this approach on two benchmarks — RULER (long-context retrieval) and CNN/DailyMail (document summarization) — and finds substantial token savings while maintaining retrieval and summarization quality. Technically, the trick leverages visual inputs to bypass subword tokenization: long passages are rasterized into a single image and consumed through the model’s visual input pathway, reducing the load on the decoder’s token stream. That yields lower token billing and mitigates decoder context limits, making it appealing for long-document workflows. Practical considerations remain — e.g., added visual-encoder compute, image resolution/OCR fidelity, and architecture compatibility — but the method promises a simple, broadly applicable way to extend effective context length and cut inference cost for multimodal LLM deployments.
Loading comments...
loading comments...