How to perform adaptive batching for massive remote LLM calls (cocoindex.io)

0 points 248 days ago ago | visit original

🤖 AI Summary

CocoIndex announced built-in batching for functions like EmbedText and SentenceTransformerEmbed (plus ColPali image/query embeds), delivering roughly 5× throughput — about an 80% reduction in runtime — when embedding the CocoIndex codebase with sentence-transformers/all‑MiniLM‑L6‑v2. In benchmarks on an M1 Pro laptop, a small text example dropped from 1.96s to 0.63s (~68% saving) and a large code‑embedding job fell from 58.93s to 12.52s (~79% saving). Enabling batching for custom ops is trivial (set batching=True and use list inputs/outputs), and existing code keeps working unchanged. Technically, CocoIndex uses an adaptive, knob‑free batching policy: while a batch executes on-device, new requests queue; when it finishes, all queued requests form the next batch window. This avoids timers or target sizes and automatically trades latency for throughput depending on load (tiny batches when sparse, large batches when busy). Per‑function packing adds efficiency: e.g., SentenceTransformerEmbed micro‑batches (default 32) to fit device memory and sorts by token length to reduce padding overhead; larger micro‑batches improve amortization of fixed costs but yield diminishing returns. Gains are biggest when fixed overhead dominates (small models, many short inputs); large models or backends that don’t support internal batching (e.g., Ollama in the experiments) show much smaller improvements.

Loading comments...

loading comments...