Dissertation Abstract

Flexible Fault-Tolerance Using Redundancy
in Mesh Connected Processor Arrays

John C. Ramirez, Ph.D.
University of Pittsburgh, 1995

ABSTRACT

Massive parallelism in computing systems allows for continued increases in computing performance even as the limit of maximum processor speed is neared. An architecture well-suited to massive parallelism is the mesh connected processor array (MCPA). MCPAs are simple in design, scalable, and can be implemented in VLSI chips and wafers. With an increase in parallelism, however, comes an increase in the probability of processors within a system becoming faulty. In critical situations, massively parallel systems must be able to function despite faulty processors, and they must be able to recover quickly from any faults that occur. Triple-modular redundancy (TMR) is an effective way of masking faults in computing systems without delay, and therefore is an appropriate means of fault-tolerance in systems when time is critical.

In this thesis, fault-tolerance in MCPAs is examined utilizing modular redundancy. First, a new mesh architecture, the Processor Switch/Voter Array (PSVA) is proposed. This architecture is flexible, reconfigurable and allows for varying levels of redundancy. Several algorithms are proposed which map communication graphs onto PSVAs with TMR so as to avoid faulty processors. These algorithms are analyzed thoroughly using Markov chains and other probabalistic techniques. An algorithm which allows recovery from run-time faults is also proposed. This algorithm utilizes spare processors in the PSVA in order to maintain TMR throughout the system despite processor failures. This run-time algorithm is analyzed through simulations.

Fault-tolerance through redundancy is also examined in general MCPAs. When communication graphs are mapped onto general MCPAs with triple-modular redundancy, the communication lines of the MCPAs become overloaded due to the extra messages necessary for TMR. This overloading can slow communication within the MCPA and reduce the performance of the system. The overloading resulting from some mappings is examined through analysis and simulation. Three protocols are then proposed which reduce the messages required by one-third while still maintaining the fault-correction capabilities of TMR. The protocols are similar in nature, but differ based on their assumptions about the system being used. These protocols can be valuable in allowing MCPAs to perform at a higher level without sacrificing the fault-correction capabilities of TMR.