Sarcouncil Journal of Engineering and Computer Sciences
Sarcouncil Journal of Engineering and Computer Sciences
An Open access peer reviewed international Journal
Publication Frequency- Monthly
Publisher Name-SARC Publisher
ISSN Online- 2945-3585
Country of origin-PHILIPPINES
Impact Factor- 3.7
Language- English
Keywords
- Engineering and Technologies like- Civil Engineering, Construction Engineering, Structural Engineering, Electrical Engineering, Mechanical Engineering, Computer Engineering, Software Engineering, Electromechanical Engineering, Telecommunication Engineering, Communication Engineering, Chemical Engineering
Editors

Dr Hazim Abdul-Rahman
Associate Editor
Sarcouncil Journal of Applied Sciences

Entessar Al Jbawi
Associate Editor
Sarcouncil Journal of Multidisciplinary

Rishabh Rajesh Shanbhag
Associate Editor
Sarcouncil Journal of Engineering and Computer Sciences

Dr Md. Rezowan ur Rahman
Associate Editor
Sarcouncil Journal of Biomedical Sciences

Dr Ifeoma Christy
Associate Editor
Sarcouncil Journal of Entrepreneurship And Business Management
GPU Reliability in AI Clusters: A Study of Failure Modes and Effects
Keywords: GPU Reliability, Thermal Failure Modes, AI Infrastructure Resilience, Predictive Maintenance, Memory Subsystem Degradation.
Abstract: This article presents a comprehensive explanation of GPU reliability challenges in artificial intelligence clusters, addressing a critical gap in understanding how modern AI workloads affect accelerator hardware. The article establishes a detailed taxonomy of GPU failure modes specific to AI workloads, with particular attention to thermal issues, power delivery instabilities, memory subsystem degradation, and manufacturing variations. The article reveals that the sustained high-utilization characteristics of deep learning training create unique stress patterns that accelerate hardware degradation through mechanisms distinct from those observed in traditional computing workloads. The article quantifies the cascading impacts of these failures on training convergence, model accuracy, system performance, and operational economics. To address these challenges, the article develops and evaluates a suite of mitigation strategies spanning proactive monitoring techniques, predictive maintenance frameworks, fault-tolerant architectural designs, and software resilience mechanisms. Case studies across large-scale training clusters, edge deployments, and cloud environments provide contextual insights into reliability variations across deployment modalities. The article presented herein offers both theoretical frameworks for understanding GPU reliability in AI contexts and practical recommendations for infrastructure operators seeking to improve system resilience without compromising computational performance. As AI hardware continues its rapid evolution toward higher power densities and architectural complexity, the reliability engineering approaches established in this article provide essential guidance for the sustainable scaling of AI infrastructure.
Author
- Sameeksha Gupta
- Meta Platforms Inc USA