ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention (arxiv.org)

0 points 1 hour ago ago | visit original

🤖 AI Summary

Researchers have introduced ThriftAttention, a novel approach to optimize attention mechanisms for long-context workloads in machine learning, particularly addressing the challenges posed by high computational costs. Traditional methods employed block-scaled quantization on Blackwell GPUs to run attention computations at 4-bit precision; however, this often led to detrimental quality losses in long-context scenarios. ThriftAttention counters this issue by intelligently selecting key query-key pairs for higher FP16 precision, while processing less crucial interactions in FP4. This two-stage method allows ThriftAttention to achieve nearly the same quality as FP16 with the efficiency of FP4. The significance of ThriftAttention lies in its ability to significantly mitigate the quality degradation frequently seen in longer sequences, achieving an impressive recovery of 89.1% of the performance gap between FP4 and FP16 by only targeting 5% of attention blocks with the higher precision. This innovation not only enhances the efficiency of long-context processing in AI models but also demonstrates scalable benefits as sequence lengths increase. With the public availability of its code, ThriftAttention stands to impact various applications in natural language processing and other fields requiring attention mechanisms.

Loading comments...

loading comments...