Story of Two GPUs: Characterizing the Resilience of Hopper H100 and Ampere A100 (arxiv.org)

0 points 56 days ago ago | visit original

🤖 AI Summary

A recent study assesses the resilience of NVIDIA's Hopper H100 and Ampere A100 GPUs within Delta, a large-scale AI system featuring over 1,056 GPUs and 1,300 petaflops of peak throughput. Analyzing 2.5 years of operational data, the research reveals critical insights: while H100 GPUs exhibit superior resilience in certain hardware aspects, their memory resilience is notably weaker, with a 3.2-fold decrease in mean time between errors (MTBE) compared to the A100. The study also highlights insufficient error-recovery mechanisms in the H100’s memory system, raising concerns about its ability to manage increased memory capacity. This analysis is significant for the AI/ML community as it underscores the necessity for robust error recovery at both hardware and application levels in GPU architectures, especially with the increasing reliance on large AI systems. The findings also indicate that to mitigate GPU failures effectively, overprovisioning of at least 5% may be required, affecting resource management and operational costs in large-scale AI deployments. As AI systems grow more complex, understanding and addressing these resilience challenges could influence the design of future GPU technologies and their integration into extensive ML workflows.

Loading comments...

loading comments...