Files in this item

FilesDescriptionFormat

application/pdf

application/pdfLONG-DISSERTATION-2020.pdf (3MB)
(no description provided)PDF

Description

Title:Understanding and mitigating privacy risk in machine learning systems
Author(s):Long, Yunhui
Director of Research:Gunter, Carl A
Doctoral Committee Chair(s):Gunter, Carl A
Doctoral Committee Member(s):Zhai, ChengXiang; Li, Bo; Shokri, Reza
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Degree:Ph.D.
Genre:Dissertation
Subject(s):privacy
machine learning
Abstract:Recent years have witnessed a rapid development in machine learning systems and a widespread increase of machine learning applications. However, with the widespread adoption of machine learning, privacy issues have emerged. This thesis studies the privacy risk in modern machine learning systems in two ways. First, we improve the understanding on machine learning privacy through attacks and measurements. Due to the increasing complexity and lack of transparency of state-of-art machine learning models, it is challenging to understand what information a model learns from its training data and whether the information could be leaked through the model’s predictions. Therefore, we design various attacks to infer different information from machine learning models trained on sensitive data. By analyzing the performance of these attacks, we get a better understanding on the privacy risk of sharing these models. Second, we propose different levels of protection mechanisms to balance between privacy and data utility. We divide the use of sensitive data in a modern machine learning system into three levels based on the trade-off between data utility and privacy protection. At the first level, we consider data with high utility requirement and relatively low privacy protection, such as system logs with heterogeneous data of high dimensionality. This type of data is very sensitive to noise injection, making it challenging to achieve strong privacy guarantee without incurring great loss on data utility. To address this problem, we propose empirical protections based on hypothesis tests. Our approach uses various hypothesis tests to identify potential information leakage from the data and adds the minimum amount of noise sufficient to mitigate the identified risks. Although this approach does not provide strong theoretical guarantee, it allows users to share their data with higher confidence and with minimum utility loss. At the second level, we consider sensitive data that need to be shared for general purposes. For example, datasets containing personal photos can be used in a wide range of applications including face recognition, human pose extraction, and mood detection. However, these photos are also extremely sensitive since they contain a lot of privacy information. For this type of data, it is important to maintain a proper balance between privacy and data utility. On the one hand, due to the sensitive nature of the data, it is necessary to apply rigorous privacy protections such as differential privacy. On the other hand, to allow multiple applications to use the released data, the privacy protection mechanisms need to preserve the original data distribution to the maximum extent possible. Based on these requirements, we design a novel approach G-PATE for training a scalable differentially private data generator, which can be used to produce synthetic datasets with strong privacy guarantee while preserving high data utility. At the third level, we consider sensitive data that are useful for specific applications. For this type of data, it is often not necessary to share the original dataset. Instead, data owners can share differentially private machine learning models tailored to the need of the applications. By only sharing the models, we limit the use of the sensitive data to the approved applications while improving model utility under the same privacy guarantee. As an example, in this thesis, we propose the first differentially private graph convolutional network (DP-GCN). By guaranteeing edge-differential privacy, DP-GCN allows users to analyze graph-structured data without leaking the sensitive connection information, such as private real-life connections in social networks.
Issue Date:2020-05-04
Type:Thesis
URI:http://hdl.handle.net/2142/107972
Rights Information:Copyright 2020 Yunhui Long
Date Available in IDEALS:2020-08-26
Date Deposited:2020-05


This item appears in the following Collection(s)

Item Statistics