Computer Science ETDs
Publication Date
5-1-2016
Abstract
High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory.
Language
English
Keywords
High-performance computing, Fault tolerance, Simulation, Modeling
Document Type
Dissertation
Degree Name
Computer Science
Level of Degree
Doctoral
Department Name
Department of Computer Science
First Committee Member (Chair)
Ferreira, Kurt B.
Second Committee Member
Arnold, Dorian
Third Committee Member
Lowenthal, David
Recommended Citation
Levy, Scott. "Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems." (2016). https://digitalrepository.unm.edu/cs_etds/31