
| Fault-Tolerant Scheduling in Multiprocessors |
We are involved in the SCHEDULING of real-time tasks in DISTRIBUTED and MULTIPROCESSOR systems so that they can TOLERATE FAULTS. One way of providing fault-tolerance is to schedule multiple copies of a task on different processors. If the PRIMARY task cannot be completed due to a fault, the scheduled BACKUP is run and the task is completed. In this paper, we propose a new algorithm for fault-tolerant scheduling on a distributed system. The algorithm guarantees the completion of a scheduled task before its deadline in the presence of a processor failures. Our algorithm schedules several backup tasks on top of one another and dynamically deallocates the backups as soon as the original tasks complete executions, thus increasing the utilization of processors.
Simulation results show that our method achieves higher task schedulability than using a spare processor as a backup to be invoked in the event of a failure. Further, we show that the cost, in terms of schedulability, of guaranteeing fault tolerance for dynamic systems in a fault-free environment is extremely low.
For a description of the ALGORITHM and some SIMULATION results you can look at the paper that appeared in IPPS '94. For an improved version of the algorithm and some analysis, including Markov ANALYSIS, you can look at the FTCS '94 paper . A paper that combines these results was published in IEEE Transactions of Parallel and Distributed Systems, March 1997.
In single processors, the techniques for overloading and de-allocation can still be used, but a better scheme can be devised. In RTSS '95 we presented a paper with a scheme to guarantee that the execution of real-time tasks can tolerate transient and intermittent faults assuming any queue-based scheduling technique. The scheme is based on reserving sufficient slack in a schedule such that a task can be re-executed before its deadline without compromising guarantees given to other tasks. Only enough slack is reserved in the schedule to guarantee fault tolerance if at most one fault occurs within a time interval. This results in increased schedulability and a very low percentage of deadline misses even if no restriction is placed on the fault separation. We provide two algorithms to solve the problem of adding fault tolerance to a queue of real-time tasks. The first is a dynamic programming optimal solution and the second is a greedy heuristic which closely approximates the optimal.
In addition, we have also tackled the issue of fault tolerance in preemptive priority scheduling, such as RMS and EDF. We discuss conditions that must be met by any such recovery scheme, although we concentrate on re-execution in the event of faults. We also extend the original RMS, EDF, and the exact characterization of RMS to provide tolerance for single and multiple transient faults. We derive schedulability bounds for sets of real-time tasks given the desired level of fault tolerance for each task or subset of tasks. Finally, we analyze and compare those bounds with existing bounds for non-fault-tolerant and other variations of RMS. A version of the FT-RMS results appeared in DCCA '97.
Sunondo Ghosh's dissertation, which combines all these aspects, is here .
FAULT TOLERANCE need not be at the level of particular tasks. To exploit system-level fault-tolerant behavior of a real-time system, we have developed a scheme to create feasible schedules either online or offline, and to explore the different avenues according to how long before the fault is treated the systems can stand alone. That is, we allow the system to switch between schedules to implement GRACEFUL DEGRADATION, FAULT MASKING, LOAD SHEDDING, and other fault tolerance for real-time systems. More information can be obtained by reading this paper .