## Files in this item

FilesDescriptionFormat

application/pdf

SU-DISSERTATION-2021.pdf (2MB)
(no description provided)PDF

## Description

 Title: Attacks on medical data and centralized social platforms: PU learning and graphical inference Author(s): Su, Du Director of Research: Lu, Yi Doctoral Committee Chair(s): Lu, Yi Doctoral Committee Member(s): Hajek, Bruce; Srikant, Rayadurgam; Viswanath, Pramod Department / Program: Electrical & Computer Eng Discipline: Electrical & Computer Engr Degree Granting Institution: University of Illinois at Urbana-Champaign Degree: Ph.D. Genre: Dissertation Subject(s): privacy social network data-mining density evolution neural networks Abstract: Sensitive information in user data, such as health status, financial history, and personal preference is facing a high risk of being exposed to adversarial entities, as well as being abused by the service provider. Prohibiting individual-level operations on user records is one natural proposal to protect the sensitive information of user data. Instead, only aggregate statistics of sensitive data are allowed to be stored or released. It is widely believed that aggregation is an effective way of preserving the typical characteristics of users for data mining tasks while preventing the infringement of individual privacy, as the aggregated statistics carry little information of each individual user. However, these privacy-preserving techniques may fail to eliminate the risks of sensitive data leakage. Aggregation is not able to fully hide features of the dataset, and adversarial entities are able to exploit the features of aggregate data to recover the sensitive information in user records. Moreover, social platforms with a high degree of centralization, such as Tiktok, are able to design special aggregation algorithm, in a way that the individual information can be recovered completely from aggregated statistics, allowing them to exploit user preferences. In this thesis, we study the two aforementioned scenarios where aggregation fails to protect individual user's sensitive data. First, we show the hazard of re-identification of sensitive class labels caused by revealing a noisy sample mean of one class. With a novel formulation of the re-identification attack as a generalized positive-unlabeled learning problem, we prove that the risk function of the re-identification problem is closely related to that of learning with complete data. We demonstrate that with a one-sided noisy sample mean, an effective re-identification attack can be devised with existing PU learning algorithms. We then propose a novel algorithm, growPU, that exploits the unique property of sample mean and consistently outperforms existing PU learning algorithms on the re-identification task. GrowPU achieves re-identification accuracy of $93.6\%$ on the MNIST dataset and $88.1\%$ on an online behavioral dataset with noiseless sample mean. With noise that guarantees $0.01$-differential privacy, growPU achieves $91.9\%$ on the MNIST dataset and $84.6\%$ on the online behavioral dataset. In the second part of the thesis, we present a randomized article-push algorithm and a message-passing reconstruction algorithm, such that social media platforms are able to infer user preferences from only the publicly available aggregate data of article-reads, without storing any individual users' actions. Its $O(n)$ complexity allows the reconstruction algorithm to scale to a large population, as is typical of social media platforms. Moreover, the feasibility of the privacy attack depends on the algorithm using as few articles as possible. We determine the minimum number of articles needed for high probability inference. Given the proportion of users, $0<\eps<1$, who prefer a given topic, the push algorithm and reconstruction algorithm can achieve an article-to-user ratio $\beta=\sqrt{\eps(1-\eps)}$, at which phase transition occurs. By formulating the inference problem as a compressed sensing problem, we show that our phase transition threshold $\sqrt{\eps(1-\eps)}$ is extremely close to that of compressed sensing, even when the latter algorithm is of a worst-case $O(n^3)$ complexity and uses a dense Gaussian measurement matrix. Issue Date: 2021-06-15 Type: Thesis URI: http://hdl.handle.net/2142/113122 Rights Information: Copyright 2021 Du Su Date Available in IDEALS: 2022-01-12 Date Deposited: 2021-08
﻿