Adaptive multirate DSP wrappers around GPT (github.com)

🤖 AI Summary
Researchers released an adaptive “DSP-augmented” transformer wrapper that applies multirate signal-processing ideas to transformer hidden states and treats traditional DSP knobs as learnable meta-parameters. On small character-level GPTs (enwik8, text8) the approach yields large gains: validation loss improvements of 19.1% (enwik8) and 12.2% (text8), convergence that’s ~65–68% faster to equivalent loss targets, and highly significant multi-seed results. The authors provide a reproducible PyTorch implementation and emphasize this is an exploratory study on small models rather than a claim of SOTA or guaranteed scaling. Technically, the modules insert three DSP blocks around standard transformer sublayers: multirate analysis–synthesis filterbanks (low-pass decimation/upsampling with detail residuals mixed back via learnable detail_strength α and mix_ratio m), low-frequency-oscillator (LFO) routing that builds time-varying gating signals from learned sinusoids per channel group, and a per-token channel bottleneck (bottleneck_ratio β) for regularization. Crucially, these DSP hyperparameters (mix ratios, downsample factors, LFO freqs/phases/amplitudes, gate temperature, bottleneck size) are optimized by a meta-gradient outer loop (meta_lr ~1e-4, updates every N steps) using EMAs of training/validation loss. Ablations show the multirate decomposition is the single largest contributor, with adaptive hyperparameter learning providing additional improvement. The work highlights a promising frequency-aware inductive bias and a practical meta-learning recipe, while noting uncertainty about behavior at larger scales or other domains.
Loading comments...
loading comments...