Universal Transformers Need Memory: Depth-State Trade-Offs in Adaptive Recursive (arxiv.org)

0 points 1 hour ago ago | visit original

🤖 AI Summary

Recent research has highlighted the critical role of learned memory tokens in the functioning of Universal Transformers (UT) equipped with Adaptive Computation Time (ACT) while solving complex Sudoku-Extreme puzzles. Experimentation revealed that without memory tokens, UT configurations consistently failed to deliver non-trivial performance, with an optimal count identified at around eight tokens for reliable success. Beyond this threshold, an oversaturation of tokens led to diminishing returns, with significant attention dilution observed at higher counts. Moreover, the study unveiled a common initialization pitfall that caused more than 70% of training attempts to fail, stemming from inadequate bias settings. Adjusting the bias to a negative value significantly improved training outcomes, indicating inherent challenges with ACT's initialization process. The findings underline ACT's superior consistency compared to fixed-depth processing, achieving better accuracy with fewer computational steps. These insights contribute valuable knowledge to the AI/ML community by emphasizing the importance of memory in deep learning architectures and informing future research on enhancing transformers.

Loading comments...

loading comments...