Douglas Niehaus, Associate Professor, Electrical Engineering and Computer Science Department, is PI for collaborative research at the University of Kansas.
Bruce McMillin, Professor, Department of Computer Science, is PI and Mariesa Crow, Professor, Department of Electrical and Computer Engineering, is co-PI for collaborative research at the University of Missouri-Rolla.
Terry Tidwell, Christopher Gill, and Venkita Subramonian, Scheduling Induced Bounds and the Verification of Preemptive Real-Time Systems, Washington University Department of Computer Science and Engineering, Technical Report WUCSE-2007-34.
In particular, our previous work has focused on the time and event semantics of middleware mechanisms whose interactions are by design (1) decoupled through explicit interfaces, and (2) temporally predictable at the time scales of interest to real-time middleware design, both of which we are able to leverage in modeling and verification of their semantics. Despite such design for predictability in the middleware mechanisms themselves, in many soft real-time systems of interest additional temporal variability and event asynchrony introduced by the application and by the system's operating environment must also be addressed.
In this current research, our discoveries include four main contributions towards this goal. First, we have developed formal models for the sharing of computational resources among different system activities. Second we have defined a practical schedulability analysis for generalized sporadic tasks whose timing behavior is subject to multiple iter-arrival constraints. Third, we have developed reconfigurability extensions to allow real-time component middleware to accommodate and enforce timing semantics of both periodic and aperiodic tasks. Fourth, we have designed (and implemented in MC-ORB) a new real-time middleware concurrency architecture specifically designed for real-time tasks atop modern multi-processor and multi-core platforms.
Because of the preemptive nature of resource management in such systems, finite timed automata (which we have applied in our previous research to model the time and event semantics of system software) alone are not able to represent the combined semantics. Instead, we have used infinite timed automata models (which we call time domain automata) to capture semantics involving alternative preemptive interleavings of execution intervals. An example of such an automaton for two preemptively interleaved threads of execution is shown in the following figure.
Dwyer, et al., have shown in their EMSOFT 2003 paper how repetition of "quasi-cyclic" structures in the state space can be exploited to reduce the complexity of model checking. As the figure above illustrates, we have extended their approach from periodic real-time systems with rate-monotonic scheduling, to real-time systems whose timing and scheduling semantics are more variable, by identifying similar repeated structure within our time domain automata models.
These bounds are often functions not only of the variables in the (potentially quasi-cyclic) state space, but also of the temporal guards on transitions along different paths through the (infinite) time domain automaton. As quasi-cyclic structures are repeated in the state space, these bounds can be simplified by parameterizing them with an index for the current repetition of the structure in the state space. The following figure illustrates such index-parameterized bounds.
A key question is whether the bounds themselves (and thus the semantics of the executing threads) converge towards a steady state bound as time progresses, and if so whether that steady state falls within certain absolute limits (e.g., prescribed by application requirements). An example of such convergent bounds, induced by fairness constraints imposed by a scheduler, is shown in the following figure.
Note how convergence occurs in the figure above: as the index goes to infinity, the bounds converge to steady state based on constant parameters that characterize the scheduler's decision function.
As is illustrated by the figure above, The example application consists of multiple processing stages for real-time image capture, transmission and display, with sub-steps on the sending and receiving hosts to compress and decompress individual images to reduce transmission latency. To avoid idling CPU and network resources, and to provide fine-grained scheduling control points throughout the application, each processing stage is serviced by a pool of threads, with each thread dedicated to a particular processing pipeline that spans multiple stages end-to-end. That concurrency architecture is illustrated by the following figure.
We ran five concurrent pipelines in this example, under the scheduling constraint that the progress of each pipeline is kept within one frame of all other pipelines. We gave some of the pipelines differing ranges of variable execution times per stage (70 to 80 msec for pipelines 1 and 3, 80 to 90 msec for pipelines 2 and 4, and 100 to 110 msec for pipeline 5). The resulting quasi-cyclic structure can be described by a 5-dimensional hypercube similar to the ones shown in the following figure for the 2-dimensional and 3-dimensional cases).
The right hand figure below shows the observed raw frame-by-frame progress of all of the pipelines. The left hand figure below offers the same comparison at only the points where all five pipelines have made equivalent progress.
To validate our formal models and the bounds we derived from them, we compared the bounds predicted from our models to the actual execution of the system in terms of frames processed in each pipeline. In addition to the absolute bounds predicted by our models, we used the models and parameters calculate tighter bounds that encapsulated the empirically observed semantics for purposes of analysis. The purpose of the recalculated bounds is to give a measure of the pessimism of the specified bounds relative to the observed behavior. However, the looser bounds are the ones that must be used for verification since they capture all possible executions absent any further specified constraint.
In the next set of figures, the right hand column compares the observed raw frame-by-frame progress to the predicted (and empirically tightened) bounds. The left hand column offers the same comparison at only the points where all five pipelines have made equivalent progress (i.e., the points where the scheduler has enforced a transition from one repetition of the quasi-cyclic structure to the next).
This schedulability analysis includes schedulability tests both for single processor cases and for end-to-end cases. For the end-to-end cases a generalized release guard protocol must be used at run-time to enforce subtask separation, as the following figure illustrates:
To evaluate the generalized sporadic task (ST) model alongside the periodic task (PT) model which it generalizes, we ran simulations to compare them in the presence of release guards (RG).
The figures below illustrate the benefit of the generalized sporadic task model: as shown by the figure to the left, when one task (T3) is allowed 30% jitter in its execution time, the bounds on the tasks' worst case response times under the generalized sporadic task model are no worse than (and in the case of the lowest priority task T1, improve upon) those under the periodic task model; as the figure to the right illustrates, when that task's jitter increases to 60%, the sporadic task model afforded significantly tighter bounds than the periodic task model, for all tasks in our experiment.
The figure on the left below offers a similar comparison of the worst case response times for the lowest priority task (which is most significantly impacted by jitter in higher priority tasks), as task T3's jitter percentage is steadily increased. The bounds remain consistent up to about 33% jitter, beyond which the response time for task T1 becomes higher and then unbounded under the periodic task model. The figure on the right below also shows a similar impact on the ratio of tasks which miss their deadlines, which also increases with the increasing jitter percentage in the periodic task model but not in the generalized sporadic task model.
In addition to the smaller task set used for illustrative purposes in the experiments described above, we also evaluated a larger (ten task) task set that is representative of more complex real world applications. As the figure on the left below illustrates, similar results were seen when one task was allowed significant jitter (in this case 75%): the generalized sporadic task model was at least as good and in many cases much better than the periodic task model, while for multiple tasks the periodic model could not bound response times at that level of jitter. The figure on the right below also shows a similar increase in the task miss ratio for the periodic task model (but not the generalized sporadic task model) as that task's jitter increases, for this task set as well.
We also examined the effect of increasing the number of simultaneous releases of a task, the number of separate time constraints on a task, and the number of tasks with multiple time constraints. As the figures below indicate, the deadline miss ratio was consistently better under the generalized sporadic task model than under the periodic task model, for this representative task set.
Further details about this contribution can be found in our ECRTS 2008 paper.
Within this architecture, admission control (AC), idle resetting (IR), and load balancing (LB) can be performed per-task or per-job, and idle resetting and load balancing also can be disabled (none). Idle resetting per job does not make sense if admission control is performed per task, so combinations involving those contradictory settings are disallowed by our configuration extensions. The figure below shows the different valid combinations of these strategy alternatives that can be configured in our approach.
As the figure below illustrates, we have implemented this architecture atop the CIAO real-time component middleware, by defining separate components for the LB, AC, and IR services and inter-connecting them (through events pushed to component ports via a federated event channel) within a chain of processing for the task that begins with a task effector component that initiates processing by the first subtask and continues through the last subtask.
To allow different applications to be configured readily with different strategy options (by answering a few questions posed by a simple textual interface), we have developed a configuration engine that acts as a front-end to CIAO's deployment and configuration engine (DAnCE), as the figures below illustrate.
To validate our approach, and to evaluate the performance, overheads, and benefits resulting from it, we conducted several experiments on a testbed consisting of six single core (Pentium) machines (one for the task manager and the others for the application processors), all connected by a 100 megabit Ethernet switch.
In the first experiment, we generated random sets of tasks each with 4 aperiodic and 5 periodic tasks. We evaluated all 15 of the reasonable strategy combinations, and measured the accepted utilization ratio -- the ratio of the utilization by jobs that were accepted and released to run, to the utilization of all jobs that were requested (some of which were not admitted) -- the ideal value for which is 1.0. The results of this experiment, shown in the figure to the left below, demonstrate the importance of performing both admission control and idle resetting per-job, with an additional though less dramatic improvement for performing load balancing per job, for these task sets.
To evaluate conditions in which application loads are distributed disproportionately among servers, we conducted a second experiment in which all tasks were assigned to 3 of the 5 application processors and duplicates were assigned to the other 2 processors. This configuration ensures that at least initially load is concentrated more heavily on the 3 application processors to which the tasks are assigned (though the duplicates allow load balancing onto the other 2 application processors if that strategy is configured). The results of this experiment, shown in the figure to the right above, demonstrate the increased importance of load balancing (per task or per job) when loads are imbalanced, in addition to confirming the importance of per job admission control and idle resetting in these cases as well.
We also measured the overheads of the service components we developed, using the same platform configuration. The figure below illustrates the different paths through the service components, and our measurements showed that even the greatest accumulated delay incurred along any of these paths was less than 2 milliseconds, which is reasonable for many distributed real-time applications.
Further details about this contribution can be found in our ICDCS 2008 and TPDS 2010 papers.
We also evaluated the offsets between the clock values for the two cores by collecting an intermediate time stamp (y) in between the time stamps immediately before (x) and after (z) the round-trip interval, as the figure on the left below illustrates. The figure in the middle below shows the relative offsets calculated from the perspective of each core, and the figure on the right below shows the absolute offset obtained by reversing the signs of the values collected on the initial core (core 0). The results of these experiments show that the clock frequencies of the cores are similar and that a non-negligible offset (e.g., around 27 microseconds) between the cores may exist, which we continued to measure and account for in our further experiments.
To evaluate the performance implications of alternative thread allocation and migration strategies, we next conducted experiments to quantify the cost of thread migration under three distinct scenarios. In the first scenario, a currently running thread migrates itself. The results of that experiment, shown in the figure on the left below, revealed a migration delay of between 16 and 45 microseconds. In the second scenario, a thread migrates another currently running thread. The results of that experiment, shown in the figure in the middle below, revealed a migration delay of between 18 and 36 microseconds. In the third scenario, a thread migrates another thread that is asleep. The results of that experiment, shown in the figure on the right below, revealed a migration delay of between 4 and 10 microseconds. These results demonstrate that thread migration delays can be large compared to other mechanism costs, and that if migration is used there is a clear (and sometimes substantial) advantage to triggering a thread's migration from another thread rather than a thread migrating itself.
We took the results of our experiments into account in designing the architecture of MC-ORB, which is illustrated by the figure to the left below: (1) a multi-core-aware task allocation service is used to reduce the number of thread migrations among cores; (2) that service is invoked exclusively by a single highest priority manager thread to avoid priority inversions; and (3) handing off task invocation requests to dedicated thread pools on each core to further reduce thread migrations. In contrast, the leader-followers architecture, which is used by previous generation object request brokers such as TAO and nORB and is illustrated in the figure on the right below, may incur significantly greater thread migration overhead on multi-core platforms. This is because the leader-followers architecture was designed to reduce thread context switches in uniprocessor platforms, but when tasks are distributed among multiple cores may find itself on another core when it receives an invocation request, than the core on which the task to be invoked resides. MC-ORB addresses this problem through the series of inter-thread hand-offs described above, which while undesirable in a uniprocessor setting (due to the additional context switches they incur) help to improve overall performance in a multi-core setting.
Further details about this contribution can be found in our RTCSA 2009 paper, and also on the MC-ORB web page where links for open-source download and installation instructions also can be found.