This research investigates how to maintain survivable operation of such systems, even in the face of invasive attacks where computations are intentionally subverted to interfere with other computations' execution constraints. The goal of this research is to develop new techniques for isolating the effects of interactions among computations through specific resources in these systems, including: flexible specification and rigorous enforcement of computations' execution constraints; explicit control of all OS kernel components under a single scheduler; detailed on-line monitoring of computations and their supporting OS kernel components; automated learning to discover previously unknown interactions among computations; and formal modeling and verification of computations, execution constraints, and system components and resources.
The expected benefits of this project include: a novel approach to non-bypassable isolation of computations from the effects of adversarial attack in which isolation can be enforced flexibly according to the system-specific execution constraints that must be satisfied; a high quality open-source software implementation of kernel-level scheduling and monitoring services that provide and measure such non-bypassable isolation; new formal models, analyses, and methodologies for verifiably correct configuration and management of those services; and empirical studies of the services' ability to protect computations from interference under a wide range of adversarial attacks.
William D. Smart, Associate Professor of Computer Science and Engineering, is co-PI for this research at Washington University in St. Louis.
Douglas Niehaus, Associate Professor, Electrical Engineering and Computer Science Department, is PI for collaborative research at the University of Kansas.
The goal of this research has been to develop new techniques for non-bypassable specification and enforcement of execution guarantees for system tasks, leveraging and extending those prior results. In this research, our discoveries to date include seven major contributions towards this goal. First, we have established a novel approach to scheduling policy design and verification, that is specifically designed for resource allocations to tasks with variable execution times. Second, we have developed a novel technique for using repeated (periodic) structure to wrap the state space into a tightly bounded form. Third, have shown how a state space suitable for verification with existing timed automaton model checkers can be generated by first enumerating verification states that consist of sets of utilization states, and then defining suitable difference bound matrices (DBMs). Fourth, we have developed a new technique called Expanding State Policy Iteration (ESPI) that allows computation of an optimal policy for a wrapped state model. Fifth, we have shown that for highly structured examples like proportional sharing of a resource, the structure of the generated scheduling policies themselves can be approximated with high accuracy, leading to a scalable technique for generation of high quality scheduling policies for larger numbers of concurrent activities. Sixth, we have shown that even when distributions of tasks execution times are unknown in advance, if the distributions are stable then reinforcement learning can be used to obtain suitable policies on-line. Seventh, we have extended our approach to include periodic tasks whose individual utility can be expressed as functions of their completion times.
Using the distributions of thread execution times, we define a Markov Decision Process (MDP) whose actions correspond to the decision to dispatch a particular available thread (in the system model described above we assume all threads are always available), and whose states correspond to the number of time quanta during which the resource was utilized by each thread (represented as integer-valued vectors). The MDP's transition function is defined in terms of the run-time distribution for each thread.
To bound the number of states, our initial approach introduced a termination time (our subsequent development of the wrapped state model, described in the next section, allows us to do away with this restriction), resulting in absorbing states from which further actions do not change the state of the system, as the figure below illustrates.
We then are able to generate a verification state space, whose states consist of reachable subsets of the utilization states on each action, based on the threads run-time distributions, as the following figure illustrates. In the figure below, utilization states are denoted by circles, which are grouped into (often overlapping) verification states denoted by the boxes around each such group.
To illustrate the use of these techniques, we generated an optimal policy for the two thread run-time distributions shown above, which is illustrated in the figure below. Important features of the figure below include (1) the specified target utilization boundary which the policy intends to enforce, which is illustrated by the black ray in each panel of the figure, (2) the slight parallel overlapping of the edge of the dark grey region (in which thread 1 is dispatched) with the target utilization ray, which is shown in the left and bottom right panels of the figure below, and (3) the further deviation and then convergence of the edge of the dark grey region near the termination time, which is shown in the top right panel of the figure.
The target utilization is a specified input to the generation of the optimal policy. The overlap of the dark grey region with the target utilization ray is due to the threads relative run-time distributions: near the ray and even slightly below it, running thread 1 has a higher probability of keeping the utilization nearer the target ray than does running thread 2. The deviation from and then convergence back to the ray near the termination time is an artifact of the termination time itself (which in turn is a motivation for our development of the wrapped state model described next). This illustration of the approach highlights its ability to encode scheduling policies for time-varying thread executions, and to enforce them with reasonably good precision. Further discussion of the approach, and details of the models used, is available in our ATC 2008 paper.
The intuition on which our second contribution is based, is that the marked states in the figure above are highly similar: they have the same reward, and the relative distribution of future states from them is the same, so that we would expect the optimal policy to prefer the same action in these states. The key idea of our approach is to detect such similar states and collapse them together to obtain a smaller, more tractable MDP model. We say that a system is periodic if and only if the utilization ray passes through utilization states at regular intervals.
This periodic property allows us to combine equivalence classes of the original utilization state space shown above, into individual states of a wrapped state model shown in the figure below. Each state in the wrapped state model corresponds to one such equivalence class, and the transitions are again shown in red for thread 1 and in black for thread 2.
One further refinement is needed since in the figure above the state space is still unbounded (though significantly reduced). We observe that the wrapped state model shown in the figure above consists of possible displacements from the utilization ray among integer-valued states. In states that are sufficiently far from the target utilization we expect the greedy policy of choosing the action that is expected to result in a state nearest target utilization is a suitable proxy for the optimal policy. With the assumption that each thread can not exceed some maximum run-time on a single execution, our strategy for bounding the state space is then to truncate the state space to include only the states in which the choice of which thread to schedule is not already determined with high probability. The resulting bounded wrapped state model is shown in the figure below, in which the absorbing states reflect the equivalence of further actions once the decision of which thread to schedule is already well determined.
The impact of wrapping the state model on the cost of verification of properties based on it, is also significant. In the resulting verification state space, the initial exploration of which is illustrated in the figure shown below, each verification state consists of the utilization states reachable from a previous verification state under a particular action. In our previous approach, as the termination time (and with it the finite history of system states over which verification states were computed) was increased the resulting verification state space grew exponentially, with a history size of 20 resulting in 2883441 verification states 3604283 and transitions, which took 10 hours 47 minutes to enumerate using the IF toolset. Our wrapping approach greatly reduces the size of the verification state space, so that for the same example an exploration using the IF toolset took less than a second as the verification state space had only 201 states and 398 transitions.
We define DBMs to represent salient features of both the wrapped state space and the scheduling policy itself. The figure above shows the general form of DBMs needed to represent the utilization regions of the wrapped state space, while the figure below shows decision regions of the policy itself, both encoded as DBMs.
We also define operators run and wrap over the DBMs, which transform a DBM according to the effects of (1) running a particular thread, or (2) wrapping the state space, respectively.
The figure above shows the effect of applying the run operator to DBM D, transforming it into DBM D', which encodes the effects of running thread 2. The left side of the figure below shows the possible DBMs that may be obtained by then applying the wrap operator to D', using different values of a specified compression parameter. The right side of the figure below shows the results of intersecting those possible DBMs with the utilization regions encoded by the wrapped state space DBMs.
The state space wrapping technique and the use of DBMs for verification allow us to: (1) remove the artificial termination time, and its resulting artifacts; (2) produce a much more compact representation of the utilization and verification state spaces which significantly reduces the cost of defining and verifying properties of scheduling policies; and (3) verify properties over state variables with continuous values as well as variables with discrete ones. Further discussion of these refinements to our approach, and details of the models and techniques used, is available in our RTSS 2008 paper.
The figure below shows an approximation to the optimal scheduling policy for a problem instance with two tasks, where each point denotes a state in the wrapped state space. The target utilization ray (corresponding to utilization shares of 7/12 and 5/12) is shown as a dashed ray eminating from the bottom left corner. Each task has a different duration distribution supported on the interval (0, 8]. Task 1 advances along the horizontal axis, and task 2 advances along the vertical axis.
The scheduling policy selects task 1 in states denoted by closed circles, and selects task 2 in states denoted by closed squares. Notic that the decision boundary (the surface separating regions of the state space where the policy is homogeneous) can be described using a ray parallel to the utilization ray, with an offset from it labeled d in the figure above.
The figure below illustrates an approximation to the optimal scheduling policy for a problem instance with three tasks, again with each distinctly shaped dot in the figure illustrating a decision to run a particular task in that state. The policy is further illustrated by plotting it at three different time horizons, with the leftmost figure showing the state space after 10 time steps, the middle figure after 20 time steps, and the leftmost figure after 30 time steps.
This leads to a compact parameterized model for n tasks consisting of n action vectors and a decision offset (labeled d as in the earlier example with two tasks). The leftmost figure below shows the parameterization of a two task example. The middle figure below illustrates the generalization of the model to three tasks, and the rightmost figure below shows how the action vectors and decision offset combine to define the decision boundaries between homogeneous regions.
To evaluate the quality of the policies generated by this new technique, we performed experiments comparing the results using this parameterized model with two different search strategies (hill climbing and policy gradient), to results obtained by a greedy policy heuristic and a heuristic of always scheduling the most underused task. With four tasks, we also found a case where we were able to compare these policies to a policy generated by our MDP based approach, though the MDP based approach was only tractable for a subset of the four task examples we tried.
The leftmost figure belows shows results from the four task case, in which both search strategies converged towards the MDP based bound, with performance comparable to the greedy heuristic and significantly outperforming the underused heuristic. The middle figure shows the results of a ten task case, with the parameterized policy significantly outperforming both the greedy and underused heuristics after about 20 iterations of search (the MDP based approach was intractable for ten tasks). The rightmost figure below shows that with only two or three tasks the MDP, greedy, and parameterized approach (after 100 iterations of search) perform similarly though the MDP based approach is slightly better: however, as the number of tasks increases (and the MDP based approach becomes intractable) the parameterized approach increasingly outperformed the greedy and underutilized approaches.
Further discussion of these extensions of our approach, and details of the models and techniques used, is available in our RTAS 2010 paper.
We assume that each task's duration obeys some underlying but unknown stationary distribution, and provide probably approximately correct (PAC) bounds on the computational complexity of learning a near-optimal policy using balanced wandering. Our result is novel, as it extends established methods for learning infinite Markov decision processes to a domain with a countably infinite state space with unbounded costs.
We compared the performance of several exploration strategies in the context of task scheduling by conducting experiments comparing optimistic interval-based, epsilon-greedy, and balanced wandering exploration strategies. To compare the performance of these exploration strategies, we generated 1000 random problem instances with two tasks. Duration distributions for these tasks consisted of discretized normal distributions supported on the interval [1,32], with means and variances selected uniformly at random from the respective intervals [1,32] and [1,8]. Utilization targets for each task were chosen according to u = (u1, u2)/(u1 + u2), where u1 and u2 were integers selected uniformly at random over [1,128]. We conducted experiments by initializing the model in the state x = (0, 0). The controller simulated a single trajectory over 20,000 decision epochs in each problem instance with each exploration strategy. We report the number of mistakes, i.e., the number of times the exploration strategy chooses an action that has greater expected cost than the greedy action. The results of these experiments are shown in the figures below. We report 95% condence intervals on the mean number of mistakes each exploration strategy makes, averaged across the problem instances described above. Note that these plots have different scales due to the variation in mistake rates among exploration strategies.
The leftmost figure compares the performance of interval-based optimistic action selection to that of an "Exploit" policy that behaves greedily with respect to the estimated model at each decision epoch. All of the interval-based exploration settings we considered exhibited statistically similar performance. Interestingly, the exploitive strategy yields similar performance to the explorative strategies despite its lack of an explicit exploration mechanism. The exploitive policy's benets are clear in compari- son to the two other exploration strategies we consider.
The middle figure illustrates the performance of greedy exploration. Notice that the mistake rate decreases along with the likelihood of taking exploratory actions, i.e., as 0 approaches zero. Explicit exploration may not improve performance in this domain. This is further supported by our results for balanced wandering. The theory behind balanced wandering is that making a few initial mistakes early on will pay off in the long run due to more uniformly accurate models.
The rightmost figure shows that this is not the case in in our scheduling domain, as a purely exploitive strategy with m = 0 outperforms each balanced wandering approach. These results suggest that the exploitive strategy may be the best available exploration method in the task scheduling problem. One plausible explanation is that the environment itself enforces rational exploration: if some task is never dispatched, the system will enter progressively more costly states as that task becomes more and more underused. Thus, eventually the esti- mated benet of running that task will be substantial enough that the exploitive strategy must use it.
Further discussion of these extensions of our approach, and details of the models and techniques used, is available in our UAI 2010 paper.
Our resulting task model is illustrated in the figure below. The top represents a classic periodic real-time task with a single job released in each period and a relative deadline after which the utility of the job falls to zero. The middle represents a generalization in which multiple jobs may be active in the system at once, with utility extending beyond the period in which they were released. The bottom represents another example, where the utility is constant for a given interval and then decays linearly to 0. As is illustrated in these examples, our task model assumes that each job's utility falls to zero at some termination time that is the same duration after the release times for all jobs of a given task (but that these offsets may differ from task to task).
The figure below illustrates how these utility curves can be integrated within an MDP-based approach that is similar but differs in several important details from our previous work. The first difference is that each task's utility function (shown at the top of the figure below) must be covolved with the distribution of execution times for the task to produce a distribution of expected potential utility density which can be used to define a reward function.
The second difference is that the state space is based on the available jobs at each discrete time step within the hyper-period (the least common multiple of all the task periods). As in our previous work, the transitions in this state space encode the scheduling of each job as a distinct action, but the third difference is that a special idle action that schedules no job during a particular time step is also always available. This last extension allows scheduling policies to be non-work-conserving, allowing execution of a job to be delayed until greater utility can be accrued from it.
The figures below illustrate a simple deterministic two task system, with the leftmost and middle figures showing the transitions for choosing to schedule each of the tasks, and the rightmost figure showing the transitions for the idle action. Note that in each of these figures the hyper-period is only four time units long, after which the transitions wrap back to the beginning of the hyper-period.
The figure below shows the set of reachable states resulting from the transitions in the figures above.
To evaluate the efficacy of our MDP-based approach, particularly in comparison to heuristic methods, we conducted experiments using three distinct utilty functions that are illustrated in the figures below. The leftmost figure represents a classic hard real-time task where utility is constant until its termination time. The middle figure represents a soft real-time task where utility is initially constant and the decays linearly until the termination time. The rightmost figure represents a target sensitive task (e.g., for processing audio-video frames) for which utility is zero initially, increases linearly to a peak, and then decays linearly to its termination time.
For each experiment, we generated task duration distributions using an upper and lower bound, and a threshold dividing the lower 80 percent and upper 20 percent of the distribution, as the figure below illustrates. Together with additional constraints, these bounds enforced a total offered load that remained mostly between 70 percent and 90 percent of system capacity, with occasional transient overloads up to 120 percent.
The figures below show the performance of the heuristics we examined, relative to the value-optimal policies generated by our MDP-based approach. The upper left figure was generated with 2 tasks, the upper right figure with 3 tasks, the lower left figure with 4 tasks, and the lower right figure with 5 tasks.
The figure below summarizes how the performance of the heuristics decays with increasing numbers of tasks, up to 5 tasks beyond which our MDP-based approach became intractable. While these results demonstrate the feasibility of extending our techniques to small numbers of tasks with diverse individual utility characteristics, they also motivate significant future work towards addressing the questions of scalability and different forms of task availability that remain.
Further discussion of these extensions of our approach, and details of the models and techniques used, is available in our RTSS 2010 paper.