Files in this item



application/pdfMining, Indexin ... Large Graph Data Sets.pdf (1MB)
(no description provided)PDF


Title:Mining, Indexing and Similarity Search in Large Graph Data Sets
Author(s):Yan, Xifeng
Subject(s):graph mining
data mining
database management
Abstract:Scalable graph mining and graph database management tools become increasingly crucial to applications with complex data in domains ranging from software engineering to computational biology. Due to their high complexity, it is often difficult, if not impossible, for human beings to manually analyze any reasonably large collection of graphs. In this dissertation, we investigate two fundamental problems in large scale graph data mining and graph database management: Given a graph data set, what are the hidden structural patterns and how can we find them? and how can we index graphs and perform similarity search in large graph data sets? Graph mining and graph data management themselves are expensive computational problems since subgraph isomorphism is NP-complete. Existing Apriori-based mining solutions face inevitable overheads when they join two existing graph patterns to form larger candidates. We develop a graph canonical labeling system, gSpan, showing both theoretically and empirically that the join operation can be avoided. Graph indexing, the second problem addressed in this dissertation, may incur an exponential number of index entries if all of the substructures in a graph database are used for indexing. The solution, gIndex, proposes a pattern-based index, which is built on frequent and discriminative graph patterns discovered through a mining process. This mining-based indexing methodology leads to the development of a compact but effective graph index structure that is orders of magnitude smaller in size but an order of magnitude faster in performance than traditional approaches. Besides graph mining and search, this dissertation provides thorough investigation of pattern summarization, pattern-based classification, constraint pattern mining, and graph similarity searching, which could leverage the mining of graph patterns. It also explores several critical applications in bioinformatics, computer systems, and software engineering, including gene relevance network analysis for functional annotation, and program flow analysis for automated software bug isolation. The developed concepts, theories, and systems hence increase our understanding of data mining principles in structural pattern discovery, interpretation and search. The formulation of a general graph information system through this study could provide fundamental supports to graph-intensive applications.
Issue Date:2006-12
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2006-2741
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-21

This item appears in the following Collection(s)

Item Statistics