2 Years of ML vs. 1 Month of Prompting (www.levs.fyi)

0 points 5 hours ago ago | visit original

🤖 AI Summary

A major automaker analytics team rebuilt a years‑long warranty-claim classification pipeline by replacing slow, expensive supervised models and thousands of brittle SQL rules with a prompt-driven LLM workflow. After building a 9‑stage preprocessing pipeline, hand-labeling hundreds to thousands of examples per symptom, and validating TF‑IDF+XGBoost as a strong baseline on heavily imbalanced data, the team benchmarked six frontier LLMs (using PR AUC as the primary metric, plus MCC and F1). Early LLMs (GPT‑3.5) were unusable, but modern models were faster, cheaper, and surprisingly capable. Nova Lite (cost ~$0.06 per 1M tokens, PR AUC ~0.716) was chosen as the best value and, after six rounds of iterative prompt tuning that used LLM-generated reasoning traces and a larger model to refine prompts, matched or beat XGBoost on 4 of 5 symptom categories (notably a 35‑point gain on “cut‑chip”). The significance: this shows classification can shift from data‑centric (months/years of annotation and pipeline engineering) to instruction‑centric development, dramatically reducing time-to-deploy for drifting taxonomies or low‑label regimes. Key technical takeaways: benchmark with PR AUC on skewed data, use reasoning traces to guide prompt refinement, and weigh cost/performance (Nova Lite’s price-performance made iterative prompting practical). Supervised models remain preferable with stable targets and millions of labels, but LLM prompting is a powerful alternative when labels are scarce or requirements change rapidly.

Loading comments...

loading comments...