Continuous Online Memory Diagnostics
Musfiq Niaz Rahman
Wednesday March 21, 2012
11:00 am - Sennott Square 6106 - Eli Lilly Room
AbstractToday's computers have gigabytes of main memory due to improved DRAM density. As density increases, smaller bit cells become more susceptible to errors. These errors can lead to application and system corruption, impacting reliability and increasing downtime. A study of thousands of computers in Google's data center revealed that the incidence of DRAM errors is surprisingly high. One in three machines had at least one error per year. Another study by Microsoft shows strong evidences of system crashes due to DRAM errors in consumer PCs. With an increase in error susceptibility, the need for memory resiliency also increases. The high error rate is an indication that new resiliency techniques will be necessary to handle errors in even larger future main memories. In my research, I propose to develop new schemes to improve memory resiliency through online diagnostic. My goal is to justify that a transparent and online software-based strategy for diagnostic memory testing can be achievable by utilizing over-provisioned system resources.
Developing a memory diagnostic is challenging due to requirements for transparency, scalability and low performance and power overheads. In my work, I will design an approach, called Continuous Online Memory Testing (COMeT), that tests memory health simultaneously with application execution. COMeT is a software-based approach that works in an online setting and executes concurrently with other applications. As the first step of my research, I will show the feasibility of COMeT for single-threaded applications in small-scale systems with limited memory capacity. In the next step, COMeT will be extended to support multi-threaded applications in systems with more memory capacity and processing cores. In the last step, I will make memory diagnostic performance- and power-aware. COMeT will aim at tuning itself dynamically to system load at runtime to achieve administrator specified performance and power budgets for the diagnostic.
Throughout this research, I will design, develop and apply different self-testing strategies on variety of applications in both small and large systems. My design will serve as a guide on how a software-only online diagnostic can be structured and integrated with an OS. I will develop new algorithms which will show how to integrate different test parameters for memory diagnostic. My design will also reveal how self-testing can coexist with OS memory management without disrupting its functionalities. I will evaluate the performance, energy, and resiliency improvement of COMeT, including an analysis of important design and configuration choices. In summary, my research will demonstrate the feasibility of a software-based online memory diagnostic and it will guide OS and application developers towards making their systems more resilient to memory errors.
Dissertation AdviserDr. Bruce R. Childers and Dr. Sangyuen Cho, Department of Computer Science
Committee MembersDr. Rami Melhem, Department of Computer Science, University of Pittsburgh
Dr. Kartik Mohanram, Department of Electrical and Computer Engineering, University of Pittsburgh