We work on parallel systems problems at the intersection of high performance computing and cloud computing
We emphasize programmable isolation and predictability as key enablers of scalable system performance
We specialize in core systems software, including operating systems, hypervisors, and parallel runtimes
About the Group

The Scalable Systems Group is a research group in the Department of Computer Science & Engineering at Washington University in St. Louis. Our research specializes in system support for large scale computing platforms, usually in the context of supercomputers or public cloud computers. A particular focus of our group is to enable consistent and/or predictable system behavior in these environments to better support many applications in the high performance computing (HPC), machine learning, and real-time communities.

Brian Kocoloski

I am an assistant professor in the CSE department at WashU and I lead the group.

My research focuses on the efficient execution of complex parallel workloads on large scale machines. Much of my work has focused on performance variability, phenomena that result in inconsistent performance characteristics during workload execution. In this space, my research has demonstrated the benefits of lightweight operating systems and lightweight virtual machines for reducing system variability.

A further goal of my research is to better support tightly-synchronized applications on more general purpose computing infrastructures, such as public cloud computers, as well as to improve the predictability, reliability and performance of heavily consolidated clouds.

The Hobbes Operating System

Hobbes began as a multi-institutional research project, lead by the US Department of Energy, to deliver an operating system for future extreme-scale parallel computing platforms. The goals of the Hobbes project were to address major technical challenges of energy efficiency, manage massive degrees of parallelism and deep memory hierarchies, and provide resilience in the presence of increasing failures. For more information on the Hobbes project, click here.

In collaboration with the Prognostic Lab at the University of Pittsburgh, our groups contributed the majority of the system software components adopted in the Hobbes OS. Our core approach was to design a multi-kernel environment capable of deploying specialized, lightweight "co-kernel" operating systems that enable performance isolation for massively parallel applications. This is a critical component for applications that leverage synchronous parallel algorithms, common in HPC, machine learning, and large scale graph analytics, where even minor perturbations caused by the operating system can dramatically limit scalability. Detailed descriptions of the major components of the Hobbes OS can be found at the Prognostic Lab.

We use Hobbes to study the benefits of lightweight kernels, lightweight virtualization, and performance isolation for applications in supercomputing and cloud computing environments. The easiest way to download and run Hobbes is via the "Hobbes-venv" repository:


Performance variability refers to phenomena that lead to performance imbalance across tasks of a parallel workload, which leads to wasted power, losses in energy efficiency, and prolonged runtimes. Despite the fact that variability is a well-known problem in the HPC community, it remains a major issue, with up to 75% of the aggregate processing time across all processors wasted due to imbalance on some of today's large scale machines.

Our group designed a new performance evaluation framework called "varbench" ("variability benchmark") to help identify sources of variability in emerging systems. Beyond measuring variability, varbench also aims to characterize variability along several key dimensions, allowing researchers to reason about the effectiveness of load balancing techniques, compare performance distributions across different architectures, and determine the impact of tunable system parameters on the types of variability produced by their machine.

We use varbench to study variability in parallel architectures, operating system kernels, hypervisors, networks, and I/O stacks. Varbench is available on github:

