🤖 AI Summary
Researchers introduce SSA, a compact 3B-parameter model fine-tuned with a GRPO objective to “read” k parallel candidate reasoning traces and emit a single final answer. Evaluated on a suite of math reasoning benchmarks (GSM8K, MATH, AIME‑24, AMC‑23, OlympiadBench), SSA-3B achieves 56.1% average accuracy—only 3.6 percentage points below the Pass@5 oracle (59.7%) and substantially above simple majority voting (49.7%) and a 7B process‑reward verifier baseline (Qwen‑PRM, 53.0%). Crucially, SSA is trained on under 5% of the data used by those larger verifiers, and the same SSA checkpoint can be plugged into frozen base LLMs up to 32B (e.g., Qwen → Llama‑3) without any re‑tuning, generalizing across base family, base size, and the number of samples k.
Technically, SSA reframes the “scaling test‑time compute” idea: rather than only sampling many reasoning paths and voting or relying on huge fine‑tuned backbones, a small learned aggregator uses parallel candidates to close much of the oracle gap. The approach delivers near-parity with much larger sequential RL fine‑tuned models while keeping a 10× smaller footprint and comparable test‑time token budgets. Implications: modular, data- and compute-efficient ensemble reasoning, easier deployment as a plug‑in verifier/aggregator, and promising transferability—though a residual oracle gap remains, suggesting room for further improvements in aggregation or candidate generation.
Loading comments...
login to comment
loading comments...
no comments yet