Files in this item



application/pdfManish_Gupta.pdf (5MB)
(no description provided)PDF


Title:Outlier detection for information networks
Author(s):Gupta, Manish
Director of Research:Han, Jiawei
Doctoral Committee Chair(s):Han, Jiawei
Doctoral Committee Member(s):Zhai, ChengXiang; Abdelzaher, Tarek F.; Aggarwal, Charu C.
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):outlier detection
Community Distribution Outliers (CDOutliers)
Evolutionary Community Outliers (ECOutliers)
data mining
outlier detection for graphs
outlier detection for networks
graph query processing
community detection
community outliers
anomaly detection
evolution in networks
Abstract:The study of networks has emerged in diverse disciplines as a means of analyzing complex relationship data. There has been a significant amount of work in network science which studies properties of networks, querying over networks, link analysis, influence propagation, network optimization, and many other forms of network analysis. Only recently has there been some work in the area of outlier detection for information network data. Outlier (or anomaly) detection is a very broad field and has been studied in the context of a large number of application domains. Many algorithms have been proposed for outlier detection in high-dimensional data, uncertain data, stream data and time series data. By its inherent nature, network data provides very different challenges that need to be addressed in a special way. Network data is gigantic, contains nodes of different types, rich nodes with associated attribute data, noisy attribute data, noisy link data, and is dynamically evolving in multiple ways. This thesis focuses on outlier detection for such networks with respect to two interesting perspectives: (1) community based outliers and (2) query based outliers. For community based outliers, we discuss the problem in both static as well as dynamic settings. Usually objects in a network form multiple communities. Most of the objects follow some popular community distribution patterns and also follow similar patterns of evolution. In a static network setting, one may find some objects each of whose community distribution does not follow any of the popular community distribution patterns. We refer to such outliers as Community Distribution Outliers (CDOutliers). The major challenge lies in extracting community patterns for various object types in an integrated way. We follow an iterative two-stage approach to identify CDOutliers, which performs pattern discovery (based on a joint non-negative matrix factorization approach) and outlier detection in a tightly integrated manner. In the dynamic setting, there are some objects which evolve in a very different way relative to other community members, and we define such objects as temporal community outliers. One of our studies is related to finding such outliers given two snapshots of a network (Evolutionary Community Outliers (ECOutliers)) while the other study is more general and focuses on a setting of multiple network snapshots (Community Trend Outliers (CTOutliers)). In both the studies, temporal patterns are discovered in an outlier-aware manner, and then outliers are discovered based on such patterns. The major challenge lies in performing cluster matching across snapshots so as to obtain temporal patterns. Another challenge is to define the outlier score once the patterns have been discovered. We propose algorithms and demonstrate the effectiveness of our algorithms in finding such outliers using both synthetic and real datasets. The other important aspect of my research is query based outlier detection. Given a heterogeneous network and a user subgraph query, the aim is to allow the user to find top-K ranked matches for the query in the network. Matches are ranked based on the rare and surprising associations (expressed as edges) within them. My work focuses on outlier detection with respect to two main kinds of queries: clique queries and general subgraph queries. This work can be useful for many critical applications like finding the most impacted subgraphs in computer networks during network attacks, finding the most critical locations to send extra aid on battlefields, etc. For the clique queries, we propose a low-cost shared neighbors index to assist subgraph matching and then propose a stricter version of negative association strength as a measure of outlierness of an association. The cliques are ranked based on the sum of outlierness of their edges and the top few are returned as Association Based Clique Outliers (ABCOutliers). For general subgraph queries, we discover top-K outliers by performing matching and outlier scoring in an integrated way. We pre-compute a graph topology index and a graph maximum metapath weight index. We exploit these data structures to propose a novel top-K algorithm for fast computation of subgraph outliers. The proposed concept of outlier detection from networks opens up a new direction of outlier detection research. The detected outliers, which cannot be found by traditional outlier detection techniques, provide new insights into the application area. The algorithms we developed can be applied to many areas, including social network analysis, cyber-security, distributed systems, health care, and bio-informatics. As both the amount of data as well as the linkage increase in a variety of domains, such network-based techniques will find more applications and more opportunities for research for various settings.
Issue Date:2013-05-28
Rights Information:Copyright 2013 Manish Gupta
Date Available in IDEALS:2013-05-28
Date Deposited:2013-05

This item appears in the following Collection(s)

Item Statistics