Files in this item



application/pdfBINDSCHADLER-DISSERTATION-2018.pdf (3MB)
(no description provided)PDF


Title:Privacy-preserving seedbased data synthesis
Author(s):Bindschadler, Vincent
Director of Research:Gunter, Carl A.
Doctoral Committee Chair(s):Gunter, Carl A.
Doctoral Committee Member(s):Zhai, ChengXiang; Borisov, Nikita; Smith, Adam D
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Data Privacy
Synthetic Data
Abstract:How can we share sensitive datasets in such a way as to maximize utility while simultaneously safeguarding privacy? This thesis proposes an answer to this question by re-framing the problem of sharing sensitive datasets as a data synthesis task. Specifically, we propose a framework to synthesize full data records in a privacy-preserving way so that they can be shared instead of the original sensitive data. The core the framework is a technique called seedbased data synthesis. Seedbased data synthesis produces data records by conditioning the output of a generative model on some input data record called the seed. This technique produces synthetic records that are similar to their seeds, which results in high quality outputs. But it simultaneously introduces statistical dependence between synthetic records and their seeds, which may compromise privacy. As a countermeasure, we introduce a new class of techniques that can achieve strong privacy notions in this setting: privacy tests. Privacy tests are algorithms that probabilistically reject candidate synthetics records which are determined to leak sensitive information. Synthetic records that fail the test are simply discarded, whereas those that pass the test are deemed safe and included in the synthetic dataset to be shared. We design two privacy tests that provably yield differential privacy. We analyze the quality of synthetic datasets based on a cryptography-inspired definition of distinguishability: if synthetic data records are indistinguishable from real records, then they are (by definition) as useful as real data. On the theory front, we characterize the utility-privacy trade-off of seedbased data synthesis. On the experimental front, we design an efficient procedure to experimentally quantify distinguishability. We experimentally validate the seedbased data synthesis framework using five probabilistic generative models. Specifically, using real-world datasets as input, we produce synthetic data records for four different application scenarios and data types: location trajectories, census microdata, medical data, and facial images. We evaluate the quality of the produced synthetic records using both application-dependent utility metrics and distinguishability, and show that the framework is capable of producing highly realistic synthetic data records while providing differential privacy for conservative parameters.
Issue Date:2018-07-02
Rights Information:Copyright 2018 Vincent Bindschadler
Date Available in IDEALS:2018-09-27
Date Deposited:2018-08

This item appears in the following Collection(s)

Item Statistics