Experimental Optical Encoder for Qwen3-VLM-2B-Instruct (github.com)

0 points 7 days ago ago | visit original

🤖 AI Summary

An experimental adapter-based “optical compression” pipeline was released that lets Qwen3-VL-2B-Instruct process very long documents by rendering pages to images and feeding them through DeepSeek’s DeepEncoder (frozen) plus a small trainable adapter that maps vision outputs into the VLM’s embedding space. The approach trades raw per-document latency for the ability to complete far longer inputs without OOM/context failures: optical processing completes 90% of long-document samples versus 22% for native text, producing a higher overall score (18% vs 12%) because native frequently exhausts the context window despite higher per-completion accuracy. Key technical points: DeepEncoder (401M params: SAM-ViT-B + CLIP-L + projector) produces 1280-dim vectors from 1024×1024 page renders; a 10.6M-parameter adapter (MLP 1280→3072→2048, 200-page embeddings, layer norm) is trained with MSE to align those outputs to Qwen3-VL-2B’s 2048-dim space. Training used 1k synthetic Wikipedia documents (5k–100k chars), took ~2–3 hours on an RTX 5070 (12GB), reduced loss by 87% and yielded ~2.2× token compression (avg tokens/day 38K→17K). Trade-offs: optical is ~4× slower (24s vs 6s) and lower accuracy on completed samples (20% vs 54.5%), but enables processing of 100+ page documents. This is experimental research code (pretrained adapter on HuggingFace: Volkopat/Qwen-VLM-Optical-Encoder) and hasn’t been validated at larger scales or production workloads.

Loading comments...

loading comments...