LLM Routing Strategies (martianlantern.github.io)

0 points 9 hours ago ago | visit original

🤖 AI Summary

LLM routing research tackles the practical problem that no single model is best for all queries: some models are cheap and handle simple prompts, others are costly but necessary for hard tasks. One approach clusters a labeled query set in embedding space, measures per-model accuracy and cost per cluster, and computes a normalized performance–efficiency score x = α·perf + (1−α)·(1−cost) to pick the best model for the nearest clusters—letting α trade off quality vs. spend. Arch Router instead frames routing as preference alignment: a compact 1.5B generative router reads a natural-language set of route policies in-prompt, emits a policy id (F) and maps policies to models via a separate table (T). This decoupling means new models only require editing T, not retraining the router; Arch-Router runs in tens of milliseconds and reports ~28× lower latency than a commercial baseline. Other lines add online learning and budgets. PILOT pretrains a shared query→model embedding from human preferences, then runs a LinUCB-style bandit that updates model embeddings and enforces a running cost budget with an online knapsack-like constraint. ROUTELLM focuses on binary strong vs. weak choices trained from preference labels, using classifiers or lightweight Bradley–Terry or matrix-factorization routers; performance is summarized by APGR (area under normalized quality vs. strong-call rate). Across papers routers cut expensive calls substantially (reported ~2.5–3.7× savings at high quality) with negligible serving overhead, but remain limited by policy/label quality and the model-to-policy mapping.

Loading comments...

loading comments...