
| Fault-tolerant Time-line scheduling in RT-Mach |
The following work was developed at the University of Pittsburgh.
In this release, we include the development and implementation of a non-preemptive real-time policy which allows for recovery from transient faults by re-executing tasks upon error detection. This policy can be used to schedule static or dynamic tasks. The non-preemptive fault-tolerant scheduling is based on building a time-line (i.e., explicit reservation of CPU time for thread execution) and providing sufficient slack on this time-line for each task to re-execute. An error signaling mechanism is provided to cause the RT Mach microkernel to re-execute tasks when errors are detected.
The source tree for 97a with FTTL is available here.
The first step is to set the new fault-tolerant (FT) timeline (TL) scheduling policy with the RTMach library call rt_set_scheduling_policy(SCHED_POLICY_FT_TIMELINE).
This will cause the scheduler to schedule a backup for every fault-tolerant thread that is scheduled in the system. Backups are overlapped such that if at most one thread is re-executed every fault_interval, then the scheduler guarantees that each thread (and its re-execution) will complete within its deadline. The fault_interval is specified by the fttl_set_fault_interval() system call. It is typically chosen to be a function of the MTTF (mean time to failure) of the transient faults in the system. Non-fault-tolerant threads may also be scheduled under the FT_TIMELINE scheduling policy. The attribute is_FT specified by fttl_thread_attribute_init() determines whether the thread is fault-tolerant or not. Backups are not scheduled for non-fault-tolerant threads. The scheduler only allocates the thread enough processor time to execute once before its deadline.
Threads are created with fttl_thread_create(), after initializing the attributes of a FTTL thread with fttl_thread_attribute_init(). This latter call initalizes the thread attributes including isFT which specifies a fault-tolerant thread, isPersistent which specifies a persistent thread, the thread's ready time (earliest allowed start time), the thread's execution time (worst case execution time required by the thread) and the thread's deadline (latest possible completion time of the thread). A non-persistent thread will be scheduled within the ready time and deadline specified in this call. If the call returns KERN_SUCCESS, the system guarantees that the thread will be executed (and, if it is fault-tolerant, possibly re-executed) within the thread deadline.
Another feature of timeline threads is known as persistence. If the isPersistent attribute is TRUE the thread created by fttl_thread_create() will not be submitted to the scheduler immediately. Instead a persistent thread will be created and the ready, execution and deadline times specified in fttl_thread_attribute_init(). will be ignored. A subsequent call to fttl_activate_persist_thread(), is required to specify these times and actually schedule an instance of a persistent thread. Multiple instances of a persistent thread may be scheduled on the timeline. An instance of a persistent thread may be removed from the timeline with a call to fttl_deactivate_persist_thread(). This call requires an instance number. An instance of a persistent thread can find its instance number by using fttl_instance_self(). A persistent thread may be fault-tolerant or non-fault-tolerant.
Upon detecting an error in a fault-tolerant thread, the user can cause the re-execution of the thread by invoking the system call fttl_set_fault_flag(flag_value) from the user thread. This system call sets a system "fault flag" to flag_value. During thread termination, the system checks whether the "fault flag" is set; if it is, the system re-executes the thread. Otherwise, it simply terminates the thread. Similarly, the user can set the flag_value to FALSE to purposely avoid triggering re-execution of the thread, even if an error was detected.