Withdraw
Loading…
Verification and testing of cloud infrastructure systems
Sun, Xudong
Loading…
Permalink
https://hdl.handle.net/2142/129839
Description
- Title
- Verification and testing of cloud infrastructure systems
- Author(s)
- Sun, Xudong
- Issue Date
- 2025-07-08
- Director of Research (if dissertation) or Advisor (if thesis)
- Xu, Tianyin
- Doctoral Committee Chair(s)
- Xu, Tianyin
- Committee Member(s)
- Ganesan, Aishwarya
- Godfrey, Philip Brighten
- Howell, Jon
- Marinov, Darko
- Suresh, Lalith
- Department of Study
- Siebel School Comp & Data Sci
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Computer systems
- cloud systems
- distributed systems
- system reliability
- formal verification
- formal specification
- theorem proving
- software testing
- Abstract
- Cloud infrastructure systems like Kubernetes, Borg, and Twine are the foundation of the modern cloud computing world. These systems are architected as a fleet of controllers. Controllers manage large-scale cluster resources and all applications running atop them, making their reliability paramount. Bugs in controllers can affect all the upper layer applications and lead to severe consequences, such as service outages, data loss, and security issues. Ensuring the reliability of controllers is notoriously hard as controllers perform sophisticated management tasks while running within complex and dynamic environments. This dissertation focuses on improving the reliability of cloud infrastructure systems using formal verification and testing techniques: Formal verification enables a path toward fully verified cloud infrastructure systems by incrementally replacing existing controllers with verified ones, and continuous and extensive testing improves controller reliability before there are verified replacements. We first present state-centric reasoning, a general approach to reasoning about the behaviors of controllers. The key idea is to reason about the cluster state shared by controllers, instead of each controller’s internal state. State-centric reasoning represents a controller’s behavior as the cluster state’s evolution—a uniform representation for diverse controllers. The uniform representation enables formal verification and efficient testing for controllers. As the first step to formally verify controllers, we present eventually stable reconciliation (ESR), a general formal specification for controller correctness. The key idea of ESR is to capture state reconciliation, the essential functionality of controllers, using a liveness property that describes how the cluster state should evolve. We formalize ESR as a concise formula in TLA. ESR is powerful enough to preclude a broad range of controller bugs and is realistic with appropriate assumptions on the environment. To close the gap between formal specifications (e.g., ESR) and controller implementation code, we present Anvil, the first framework that allows developers to build formally verified, practical controller implementations. Anvil emphasizes verifying both liveness and safety properties for implementation code. To achieve this goal, Anvil combines Hoare-style reasoning for imperative code and TLA-style reasoning for state machines. To reduce the manual proof burden, Anvil provides verification support, including reusable models and lemmas. With Anvil, we have built the first verified Kubernetes controllers for managing critical distributed applications. The verified controllers achieve feature parity and competitive performance compared to their unverified, mature references. Formal verification offers strong correctness guarantees, but we still need to test existing controllers continuously and extensively before there are verified replacements. Testing can also catch bugs that originate from the interaction between verified and unverified systems. However, previous work does not offer a generally applicable, comprehensive, and efficient testing approach for diverse controllers. To address this problem, we present Sieve, the first automatic reliability testing tool for controllers. Sieve’s key idea is state perturbation: Sieve perturbs the controller’s view of the cluster state in ways it is expected to tolerate, and then compares the cluster state’s evolution with and without perturbations to detect triggered bugs. We evaluated Sieve on ten popular open-source controllers of various kinds. Sieve found 46 new bugs in total, among which 35 have been confirmed (22 fixed) after we reported them. This dissertation marks a first step toward building fully verified cloud infrastructure systems. We conclude by outlining future directions to advance this vision.
- Graduation Semester
- 2025-08
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129839
- Copyright and License Information
- Copyright 2025 Xudong Sun
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…