🤖 AI Summary
Researchers tuned Qwen-2.5-3B to produce short but still relevant answers by turning the multi-objective training problem into a feedback-control task. Instead of hand-searching a fixed weight between brevity and relevance, they defined two reward components—negative length (reward = −len(response)) and an LLM judge score (gpt-5-nano scaled ×1000)—and set a target judge score of 3,000. Early RL-only tuning produced degenerate behavior (truncation, single characters). After adding the judge and then replacing static weight search with a proportional-derivative (PD) feedback controller that adjusts the balance parameter α based on error and error derivative, the system reached the 3,000 rating by step 22 and held it stably while response length continued falling (≈81 characters at step 75).
This demonstrates a practical, control-theoretic approach to constrained or multi-objective optimization in transformer training: treat metrics as measurable outputs, tune a control input (loss weight) with P/PD logic, and maintain target constraints precisely. Technical takeaways: scale heterogeneous rewards to comparable magnitudes, use a judge model as a differentiable-ish signal, and apply PD control to avoid overshoot. Challenges remain (controller gain tuning), but the method offers a principled, interpretable alternative to manual hyperparameter search and is broadly applicable to balancing KL penalties, denoising vs controllability, regularizers, and other competing objectives.
Loading comments...
login to comment
loading comments...
no comments yet