|Abstract:||Data is everywhere, abundant, continuously increasing, and heterogeneous. For example, Web pages on the Internet are rapidly growing, data in commercial databases are continuously accumulating, as well as biological data (e.g., microarray data), scientific data (e.g., astronomical data), and text data (e.g., news articles and science journals). Extracting useful information or knowledge from such massive data is important but hard: rich data but poor information is a common phenomenon in the real world. Data mining refers to extracting or mining useful knowledge from large amounts of data. Data mining includes the tasks of data classification, clustering or outlier analysis, concept description or visualization, association analysis, and evolution or trend analysis.
Many statistical and machine learning methods have been employed for data mining sub-tasks. Based on a strong mathematical foundation, SVMs (Support Vector Machines) have been one of the most actively developed classification and regression methodologies due to their salient properties such as margin maximization and systematic nonlinear classification via kernel tricks. However, SVMs have not been as favored for large-scale data mining because of several challenges:
1. Scalability: SVMs are unscalable to data size while common data mining applications often involve millions or billions of data objects.
2. Applicability: SVMs are limited to (semi-)supervised learning which is mostly applied to binary classification problems.
3. Interpretability: It is hard to interpret and extract knowledge from SVM models.
In this thesis, these challenges are identified and the SVM technologies are advanced to the practice of data mining with the proposal of several practical data mining solutions as follows.
1. A scalable SVM classification method, called CB-SVM (Clustering-Based SVM), is proposed. CB-SVM effectively approximates an SVM classification function in a single scan of data given a fixed size of memory. CB-SVM applies a hierarchical micro-clustering method that provides an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learning the SVM.
2. Two single-class classification (SCC) methods are developed using SVMs. SCC seeks to distinguish one class of data from the universal set of multiple classes. (We call the target class "positive" and the complement set of samples "negative." In SCC problems, it is assumed that a reasonable sample of the negative data is not available. Since it is not natural to collect the ``non-interesting'' objects (i.e., negative data) to train the concept of the ``interesting'' objects (i.e., positive data), SCC problems are prevalent in real world where positive and unlabeled data are widely available but negative data are hard or expensive to acquire. Two SCC algorithms - Mapping Convergence (MC) and Support Vector Mapping Convergence (SVMC) - are proposed that compute an accurate boundary of the target class from positive and unlabeled data (without labeled negative data).
3. Finally, SVM is employed to identify compact and highly discriminative feature combinations for a specified class of data by directly analyzing the SVM model of polynomial kernel. Nowadays in bioinformatics, high throughput techniques such as microarray experiments make it feasible to examine and collect massive data at the molecular level. These data, typically mapped to a very high dimensional feature space, carry rich information about functionalities of certain biological entities and can be used to infer valuable knowledge for the purposes of classification and prediction. Typically, a small number of features or feature combinations play determinant roles in functional discrimination. The identification of such features or feature combinations is of great importance. A method is presented that employs SVM to discover compact and highly discriminative features or feature combinations from a rich feature collection.
Our results show that, by appreciating the SVM's prominent properties, the proposed methods are shown better than other existing methods in many real-world data mining problems.