Files in this item



application/pdfTowards Accurat ... Pattern-Based Approach.pdf (2MB)
(no description provided)PDF


Title:Towards Accurate and Efficient Classification: A Discriminative and Frequent Pattern-Based Approach
Author(s):Cheng, Hong
Subject(s):machine learning
data mining
Abstract:Classification is a core method widely studied in machine learning, statistics, and data mining. A lot of classification methods have been proposed in literature, such as Support Vector Machines, Decision Trees, and Bayesian Networks, most of which assume that the input data is in a feature vector representation. However, in some classification problems, the predefined feature space is not discriminative enough to distinguish between different classes. More seriously, in many other applications, the input data has very complex structures, but with no initial feature vector representation, such as transaction data (e.g., customer shopping transactions), sequences (e.g., protein sequences and software execution traces), graphs (e.g., chemical compounds and molecules, social and biological networks), semi-structured data (e.g., XML documents), and text data. For both scenarios, a primary question is how to construct a discriminative and compact feature set, on the basis of which, classification could be performed to achieve good classification performance. Although a lot of kernel-based approaches have been proposed to transform the feature space and, as a way to measure the similarity between two data objects, the implicit definition of feature space makes the kernel-based approach hard to interpret, and the high computational complexity makes it hard to scale to large problem sizes. A concrete example of complex structural data classification is classifying chemical compounds to various classes (e.g., toxic vs. nontoxic, active vs. inactive), where a key challenge is how to construct discriminative graph features. While simple features such as atoms and links are too simple to preserve the structural information, graph kernel methods make it hard to interpret the classifiers. In this dissertation, I proposed to use frequent patterns as higher-order and discriminative features to characterize data, especially complex structural data, and thus enhance the classification power. Towards this goal, I designed a framework of discriminative frequent pattern-based classification which has been shown to improve the classification performance significantly. Theoretical analysis is provided to reveal the association between a feature's frequency and its discriminative power, thus demonstrate that frequent pattern is a good candidate as discriminative feature. Due to the explosive nature of frequent pattern mining, the frequent pattern-based feature construction could be a computational bottleneck, if the whole set of frequent patterns w.r.t. a minimum support threshold are generated. To overcome this computational bottleneck, I proposed two solutions: DDPMine and LEAP which directly mine the most discriminative features without generating the complete set. Both methods have been shown to improve efficiency while maintaining the classification accuracy. I further applied the discriminative frequent pattern-based classification to classifying chemical compounds with very skewed class distribution, which poses challenges for both feature construction and model learning. An ensemble framework which includes the ensembles in both the data space and the feature space is proposed to handle the challenges and shown to achieve good classification performance. In conclusion, the framework of discriminative frequent pattern-based classification could lead to a highly accurate, efficient and interpretable classifier on complex data. The pattern-based classification technique would have great impact in a wide range of applications including text categorization, chemical compound classification, software behavior analysis and so on.
Issue Date:2008-05
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2008-2959
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-22

This item appears in the following Collection(s)

Item Statistics