Files in this item



application/pdfDong_Wang.pdf (4MB)
(no description provided)PDF


Title:On quantifying the quality of information in social sensing
Author(s):Wang, Dong
Director of Research:Abdelzaher, Tarek F.
Doctoral Committee Chair(s):Abdelzaher, Tarek F.
Doctoral Committee Member(s):Han, Jiawei; Huang, Thomas S.; Aggarwal, Charu C.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Social Sensing
Truth Discovery
QoI Quantification
Maximum Liklihood Estimation
Cramer-Rao Lower Bound
Apollo Fact-finding
Abstract:This thesis develops the fundamental theory and methodology for quantifying the Quality of Information (QoI) in social sensing. We refer social sensing to the sensing applications where humans play a critical role in the sensing or data collection process. Social sensing has emerged as a new paradigm for sensory data collection, which is motivated by the proliferation of mobile platforms equipped with a variety of sensors (e.g., GPS, camera, microphone, motion and etc.) in the possession of common individuals, networking capabilities that enable fast and convenient data sharing (e.g., WiFi and 4G) and large-scale dissemination of opportunities (Twitter, Flicker and etc.). A significant challenge in social sensing applications lies in ascertaining the correctness of collected data and the reliability of information sources. We call this challenge QoI quantification in social sensing. Unlike the case with well-calibrated and well-tested infrastructure sensors, humans are less reliable. The term, participant (or source) {\em reliability\/} is used to denote the probability that the participant reports correct observations. Reliability may be impaired because of poor used sensor quality, lack of sensor calibration, lack of (human) attention to the task, or even intent to deceive. Moreover, data collection is often open to a large population, where it is impossible to screen all participants beforehand. The likelihood that a participant’s measurements are correct is usually unknown a priori. Consequently, it is very challenging to ascertain the correctness of the collected data from unreliable sources with unknown reliability. Meanwhile, it is also challenging to ascertain the reliability of each information source without knowing whether their collected data are true or not. Therefore, the main questions posed in this thesis are: i) whether or not we can determine, in an optimal way, given only the measurements collected and without knowing the reliability of sources, which of the reported observations are true and which are not? ii) whether a source (participant) is reliable or not? iii) how to quantify the answers of the above questions? The thesis answered the above questions by applying the key insights from estimation theory and data fusion to come up with new theories to accurately quantify both the participant reliability and correctness of their observations for social sensing applications. Contrary to a large amount of literature in data mining and machine learning that use various kinds of heuristics whose inspiration can be traced back to Google's PageRank to solve a similar trust analysis problem in information networks, our approach provides the first optimal solution to the OoI quantification problem in social sensing by casting it as one of expectation maximization (EM) and quantifies the estimation confidence using the Cramer-Rao lower bound (CRLB) from estimation theory. More specifically, this thesis addressed the QoI quantification challenge of social sensing from the following perspectives. First, we developed an analytically-founded Bayesian interpretation of the basic fact-finding scheme that is popularly used in data-mining literature to rank both sources and their asserted information based on credibility values. Our method offers the first probability based semantics to interpret the credibility results output by the fact-finders. It leads to a direct quantification on both participant reliability and correctness of the observations they asserted. The Bayesian interpretation is an approximation scheme based the linearity assumption made by the basic fact-finders, which motivates our further efforts to find the optimal solution to the QoI quantification problem in social sensing. Second, we developed a maximum likelihood estimator by intelligently casting the QoI quantification problem in social sensing into an expectation maximization problem that can be solved optimally and efficiently. The EM scheme overcomes the approximate limitation of Bayesian interpretation and a large amount of heuristics in trust analysis of information networks. It provides the first optimal solution (in the sense of maximum likelihood estimation) to jointly estimate participant reliability and the correctness of their reported measured variables in the way that is most consistent with the data collected. Third, a key quantification metric that is missing in all previous fact-finding literature is the confidence quantification of the estimation results. Without such a confidence metric, the estimation results lack important bounds to correctly characterize their accuracy. Thanks to the expectation maximization formulation of the problem, we are able to exactly quantify the confidence in the maximum likelihood estimation of EM scheme. Specifically, we obtained both real and asymptotic confidence bounds of the participant reliability estimation based on the Cramer-Rao lower bound in estimation theory and studied their scalability and robustness limitations for different application scenarios. Fourth, considering some simplifying assumptions we made in our original model, we extended the model and the maximum likelihood approach to address them. In particular, we extended the maximum likelihood estimator to solve the QoI quantification problem when conflicting observations exists and the measured variables are non-binary. Fifth, given the original iterative EM algorithm may not be efficient for the streaming data, we proposed a recursive EM algorithm that can compute the estimation results on the fly. We evaluate the performance of the recursive EM algorithm over different tradeoff dimensions such as trustworthiness of sources, freshness of input data and timeliness of algorithm execution. Finally, the developed theory above has been implemented and built into the core of Apollo, a data distillation service for social sensing applications. Apollo is designed to filter the noisy social sensing data by leveraging the developed theory to jointly estimate the credibility of information sources and the observations made by them, then remove less credible observations. We evaluated the performance of Apollo through the real world social sensing applications that report the progression of several real events (e.g., Egypt Unrest, Hurricane Irene and etc.). Results demonstrated that Apollo effectively cleans out the input data and correctly identifies the true and important information from a large crowd of unreliable human sources.
Issue Date:2013-02-03
Rights Information:Copyright 2012 Dong Wang
Date Available in IDEALS:2013-02-03
Date Deposited:2012-12

This item appears in the following Collection(s)

Item Statistics