Files in this item



application/pdfGao_Jing.pdf (2MB)
(no description provided)PDF


Title:Exploring the power of heterogeneous information sources
Author(s):Gao, Jing
Director of Research:Han, Jiawei
Doctoral Committee Member(s):Zhai, ChengXiang; Abdelzaher, Tarek F.; Fan, Wei
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):data mining
multiple information sources
semi-supervised learning
anomaly detection
consensus combination
inconsistency detection
transfer learning
stream classification
information networks
system debugging
Abstract:The big data challenge is one unique opportunity for both data mining and database research and engineering. A vast ocean of data are collected from trillions of connected devices in real time on a daily basis, and useful knowledge is usually buried in data of multiple genres, from different sources, in different formats, and with different types of representation. Many interesting patterns cannot be extracted from a single data collection, but have to be discovered from the integrative analysis of all heterogeneous data sources available. Although many algorithms have been developed to analyze multiple information sources, real applications continuously pose new challenges: Data can be gigantic, noisy, unreliable, dynamically evolving, highly imbalanced, and heterogeneous. Meanwhile, users provide limited feedback, have growing privacy concerns, and ask for actionable knowledge. In this thesis, we proposed to explore the power of multiple heterogeneous information sources in such challenging learning scenarios. There are two interesting perspectives in learning from the correlations among multiple information sources: Explore their similarities (consensus combination), or their differences (inconsistency detection). In consensus combination, we focused on the task of classification with multiple information sources. Multiple information sources for the same set of objects can provide complimentary predictive powers, and by combining their expertise, the prediction accuracy is significantly improved. However, the major challenge is that it is hard to obtain sufficient and reliable labeled data for effective training because they require the efforts of experienced human annotators. In some data sources, we may only have a large amount of unlabeled data. Although such unlabel information do not directly generate label predictions, they provide useful constraints on the classification task. Therefore, we first propose a graph based consensus maximization framework to combine multiple supervised and unsupervised models obtained from all the available information sources. We further demonstrate the benefits of combining multiple models on two specific learning scenarios. In transfer learning, we propose an effective model combination framework to transfer knowledge from multiple sources to a target domain with no labeled data. We also demonstrate the robustness of model combination on dynamically evolving data. On the other hand, when unexpected disagreement is encountered across diverse information sources, this might raise a red flag and require in-depth investigation. Another line of my thesis research is to explore differences among multiple information sources to find anomalies. We first propose a spectral method to detect objects performing inconsistently across multiple heterogeneous information sources as a new type of anomalies. Traditional anomaly detection methods discover anomalies based on the degree of deviation from normal objects in one data source, whereas the proposed approach detects anomalies according to the degree of inconsistencies across multiple sources. The principle of inconsistency detection can benefit many applications, and in particular, we show how this principle can help identify anomalies in information networks and distributed systems. We propose probabilistic models to detect anomalies in a social community by comparing link and node information, and to detect system problems from connected machines in a distributed systems by modeling correlations among multiple machines. In this thesis, we go beyond the scope of traditional ensemble learning to address challenges faced by many applications with multiple data sources. With the proposed consensus combination framework, labeled data are no longer a requirement for successful multi-source classification, instead, the use of existing labeling experts were maximized by integrating knowledge from relevant domains and unlabeled information sources. The proposed concept of inconsistency detection across multiple data sources opens up a new direction of anomaly detection. The detected anomalies, which cannot be found by traditional anomaly detection techniques, provide new insights into the application area. The algorithms we developed have been proved useful in many areas, including social network analysis, cyber-security, and business intelligence, and have the potential of being applied to many other areas, such as healthcare, bioinformatics, and energy efficiency. As both the amount of data and the number of sources in our world have been exploding, there are still great opportunities as well as numerous research challenges for inference of actionable knowledge from multiple heterogeneous sources of massive data collections.
Issue Date:2012-02-06
Genre:Dissertation / Thesis
Rights Information:Copyright 2011 Jing Gao
Date Available in IDEALS:2012-02-06
Date Deposited:2011-12

This item appears in the following Collection(s)

Item Statistics