The Scalable Systems Group is a research group in the Department of Computer Science & Engineering at Washington University in St. Louis. Our research specializes in system support for large scale computing platforms, usually in the context of supercomputers or public cloud computers. A particular focus of our group is to enable consistent and/or predictable system behavior in these environments to better support many applications in the high performance computing (HPC), machine learning, and real-time communities.
Office: Jolley Hall 212
Campus Box 1045
One Brookings Drive
St. Louis, MO 63130
I am an assistant professor in the CSE department at WashU, and I lead the group.
My research focuses on the efficient execution of complex parallel workloads on large scale machines. Much of my work has focused on performance variability, phenomena that result in inconsistent performance characteristics during workload execution. In this space, my research has demonstrated the benefits of lightweight operating systems and lightweight virtual machines for reducing system variability.
A further goal of my research is to better support tightly-synchronized applications on more general purpose computing infrastructures, such as public cloud computers, as well as to improve the predictability, reliability, and performance of heavily consolidated clouds.
The Hobbes Operating System
Hobbes began as a multi-institutional research project, lead by the US Department of Energy, to deliver an operating system for future extreme-scale parallel computing platforms. The goals of the Hobbes project were to address major technical challenges of energy efficiency, manage massive degrees of parallelism and deep memory hierarchies, and provide resilience in the presence of increasing failures. For more information on the Hobbes project, click here.
In collaboration with the Prognostic Lab at the University of Pittsburgh, our groups contributed the majority of the system software components adopted in the Hobbes OS. Our core approach was to design a multi-kernel environment capable of deploying specialized, lightweight "co-kernel" operating systems that enable performance isolation for massively parallel applications. This is a critical component for applications that leverage synchronous parallel algorithms, common in HPC, machine learning, and large scale graph analytics, where even minor perturbations caused by the operating system can dramatically limit scalability. Detailed descriptions of the major components of the Hobbes OS can be found at the Prognostic Lab.
We use Hobbes to study the benefits of lightweight kernels, lightweight virtualization, and performance isolation for applications in supercomputing and cloud computing environments. The easiest way to download and run Hobbes is via the "Hobbes-venv" repository:
Performance variability refers to phenomena that lead to performance imbalance across tasks of a parallel workload, which leads to wasted power, losses in energy efficiency, and prolonged runtimes. Despite the fact that variability is a well-known problem in the HPC community, it remains a major issue, with up to 75% of the aggregate processing time across all processors wasted due to imbalance on some of today's large scale machines.
Our group designed a new performance evaluation framework called "varbench" ("variability benchmark") to help identify sources of variability in emerging systems. Beyond measuring variability, varbench also aims to characterize variability along several key dimensions, allowing researchers to reason about the effectiveness of load balancing techniques, compare performance distributions across different architectures, and determine the impact of tunable system parameters on the types of variability produced by their machine.
We use varbench to study variability in parallel architectures, operating system kernels, hypervisors, networks, and I/O stacks. Varbench is available on github:
|ICPP '19||D. Zahka, B. Kocoloski, and K. Keahey, Reducing Kernel Surface Areas for Isolation and Scalability, in Proceedings of the 48th International Conference on Parallel Processing, August 2019.|
|ICPP '18||B. Kocoloski and J. Lange, Varbench: an Experimental Framework to Measure and Characterize Performance Variability, in Proceedings of the 47th International Conference on Parallel Processing, August 2018.|
|ISC HPC '18||H. Weisbach, B. Gerofi, B. Kocoloski, H. Härtig, and Y. Ishikawa, Hardware Performance Variation: a Comparative Study using Lightweight Kernels, in Proceedings of the International Conference on High Performance Computing, June 2018.|
|SC '16||N. Evans, B. Kocoloski, J. Lange, K. Pedretti, S. Mukherjee, R. Brightwell, M. Lang, and P. Bridges, Hobbes Node Virtualization Layer: System Software Infrastructure for Application Composition and Performance Isolation (Poster), in Proceedings of the 28th Annual IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016.|
|CLUSTER '16||B. Kocoloski, L. Piga, W. Huang, I. Paul, and J. Lange, A Case for Criticality Models in Exascale Systems, in Proceedings of the 18th IEEE International Conference on Cluster Computing, September 2016.|
|MEMSYS '15||B. Kocoloski, Y. Zhou, B. Childers, and J. Lange, Implications of Memory Interference for Composed HPC Applications, in Proceedings of the 1st International Symposium on Memory Systems, October 2015.|
|HPDC '15||B. Kocoloski and J. Lange, XEMEM: Efficient Shared Memory for Composed Applications on Multi OS/R Exascale Systems, in Proceedings of the 24th International ACM Symposium on High Performance Parallel and Distributed Computing, June 2015.|
|HPDC '15||J. Ouyang, B. Kocoloski, J. Lange, and K. Pedretti, Achieving Performance Isolation with Lightweight Co-kernels, in Proceedings of the 24th International ACM Symposium on High Performance Parallel and Distributed Computing, June 2015.|
|IPDPS '14||B. Kocoloski and J. Lange, HPMMAP: Lightweight Memory Management for Commodity Operating Systems, in Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, May 2014.|
|SOCC '12||B. Kocoloski, J. Ouyang, and J. Lange, A Case for Dual Stack Virtualization: Consolidating HPC and Commodity Applications in the Cloud, in Proceedings of the 3rd ACM Symposium on Cloud Computing October 2012.|
|ExaMPI '16||N. Evans, K. Pedretti, S. Mukherjee, R. Brightwell, B. Kocoloski, J. Lange, P. Bridges, Remora: A MPI runtime for Composed Applications at Extreme Scale, in Proceedings of the Workshop on Exascale MPI, November 2016.|
|ROSS '16||N. Evans, K. Pedretti, B. Kocoloski, J. Lange, M. Lang, and P. Bridges, A Cross-Enclave Composition Mechanism for Exascale System Software, in Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers, June 2016.|
|ROSS '15||B. Kocoloski, J. Lange, H. Abbasi, D. Bernholdt, T. Jones, J. Dayal, N. Evans, M. Lang, J. Lofstead, K. Pedretti, and P. Bridges, System-Level Support for Composition of Applications, in Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, June 2015.|
|ROSS '12||B. Kocoloski and J. Lange, Better than Native: Using Virtualization to Improve Compute Node Performance, in Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers, June 2012.||pptx|
|TPDS||B. Kocoloski and J. Lange, Lightweight Memory Management for High Performance Applications in Consolidated Environments, IEEE Transactions on Parallel and Distributed Systems, Volume 27, Issue 2, pages 468-480, February 2016.|
|IJHPCA||B. Kocoloski and J. Lange, Improving Compute Node Performance Using Virtualization, International Journal of High Performance Computing Applications, Volume 27, Issue 2, pages 124-135, May 2013.|