A conceptual model for transparent, reusable, and collaborative data cleaning
Parulian, Nikolaus Nova
Loading…
Permalink
https://hdl.handle.net/2142/121511
Description
Title
A conceptual model for transparent, reusable, and collaborative data cleaning
Author(s)
Parulian, Nikolaus Nova
Issue Date
2023-07-13
Director of Research (if dissertation) or Advisor (if thesis)
Ludäscher, Bertram
Doctoral Committee Chair(s)
Ludäscher, Bertram
Committee Member(s)
Downie, John Stephen
Diesner, Jana
Bosch, Nigel
Department of Study
Information Sciences
Discipline
Information Sciences
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Data Cleaning
Data Quality
Provenance
Data Preparation
Machine Learning
Artificial Intelligence
Workflows
Metadata
Visualization
Language
eng
Abstract
Data cleaning is an essential component of data preparation in machine learning and other data science workflows. It is a time-consuming and error-prone task that can greatly affect the reliability of subsequent analyses. Tools must capture provenance information to ensure transparent and auditable data-cleaning processes. However, existing provenance models have limitations in tracing and querying changes at different levels of granularity. To address this, we proposed a new conceptual model that captures fine-grained retrospective provenance and extends it with prospective provenance to represent operations or workflows that change the datasets. This hybrid model allows powerful queries and supports advanced use cases like auditing data cleaning workflows. Additionally, we extended the model to present a conceptual model focusing on reusability and collaboration in data cleaning. It addresses scenarios where multiple users contribute to dataset changes and enables tracking of curator actions, identifying dependencies between cleaning operations, and facilitating collaboration. Through an experimental case study, we demonstrated the reusability of data-cleaning workflows, different users' contributions, and collaboration's effectiveness in improving data quality.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.