Balancing effectiveness and flakiness of non-deterministic machine learning tests

Xia, Chunqiu

Balancing effectiveness and flakiness of non-deterministic machine learning tests

Xia, Chunqiu

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/127399

Description

Title

Balancing effectiveness and flakiness of non-deterministic machine learning tests

Author(s)

Xia, Chunqiu

Issue Date

2024-12-09

Director of Research (if dissertation) or Advisor (if thesis)

Zhang, Lingming

Department of Study

Siebel School Comp & Data Sci

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Machine Learning
Software Engineering

Abstract

Testing Machine Learning (ML) projects is challenging due to inherent non-determinism of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting assertion bounds for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs. We present FASER – the first systematic approach for balancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal assertion bounds. FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 23 out of 87 studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers, each fixing one test, out of which 14 pull requests have already been accepted.

Graduation Semester

2024-12

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/127399

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Balancing effectiveness and flakiness of non-deterministic machine learning tests

Xia, Chunqiu

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In