David vs. Goliath: are small LLMs any good? (fin.ai)

🤖 AI Summary
Intercom experimented with replacing single large LLM calls in their Fin RAG pipeline with two smaller, task-specialized models: a ModernBERT-based binary "issue detector" and a generatively fine‑tuned "issue extractor" using LoRA adapters. The detector was trained on ~1M examples and achieved 0.995 AUC; the extractor was fine-tuned on 60k train / 10k validation samples using parameter‑efficient LoRA on open models (Gemma 8b, Qwen3 8b, and Qwen3 14b). Offline, Qwen3 14b matched the production baseline’s answer-rate and had high semantic alignment (≈0.93–0.94). An important data lesson: a variant trained only on "hard" resolutions became overly conservative and often produced no output, highlighting how curation shapes model behavior. In production A/B tests the split models showed near-parity with the large-model baseline while lowering cost and keeping customer experience stable: the detector caused a -0.5pp answer-rate change, −100 ms P50 latency and −5% cost; the extractor caused −0.1pp answer-rate, +100 ms P50, and −12.5% cost with no CSAT impact. Key takeaways for the AI/ML community: for narrow, well-scoped tasks, smaller fine‑tuned models (with ModernBERT for classification and LoRA for efficient generative tuning) can deliver production-grade accuracy and substantial cost savings — but success depends heavily on careful task decomposition and high-quality, representative training data.
Loading comments...
loading comments...