Training Qwen to answer briefly yet intelligently using feedback control (www.runrl.com)

🤖 AI Summary
Researchers tuned Qwen-2.5-3B to produce short but still relevant answers by turning the multi-objective training problem into a feedback-control task. Instead of hand-searching a fixed weight between brevity and relevance, they defined two reward components—negative length (reward = −len(response)) and an LLM judge score (gpt-5-nano scaled ×1000)—and set a target judge score of 3,000. Early RL-only tuning produced degenerate behavior (truncation, single characters). After adding the judge and then replacing static weight search with a proportional-derivative (PD) feedback controller that adjusts the balance parameter α based on error and error derivative, the system reached the 3,000 rating by step 22 and held it stably while response length continued falling (≈81 characters at step 75). This demonstrates a practical, control-theoretic approach to constrained or multi-objective optimization in transformer training: treat metrics as measurable outputs, tune a control input (loss weight) with P/PD logic, and maintain target constraints precisely. Technical takeaways: scale heterogeneous rewards to comparable magnitudes, use a judge model as a differentiable-ish signal, and apply PD control to avoid overshoot. Challenges remain (controller gain tuning), but the method offers a principled, interpretable alternative to manual hyperparameter search and is broadly applicable to balancing KL penalties, denoising vs controllability, regularizers, and other competing objectives.
Loading comments...
loading comments...