|Abstract:||In many real-world applications of data mining, datasets can be represented using matrices, where rows of the matrix correspond to objects (or data instances) and columns to features (or attributes). Often the datasets are in high-dimensional feature space. For example, in the vector space model of text data, the feature dimension is the vocabulary size. If representing a social network using an adjacency matrix, the feature dimension corresponds to the number of objects in the network. Many other datasets also fall into this category, such as genetic datasets, images, and medical datasets. Even though the feature dimension is enormous, a common observation is that the high-dimensional datasets may (approximately) lie in a subspace of smaller dimensionality, due to dependency or correlation among features. This thesis studies the problem of automatically identifying the low-dimensional space that high-dimensional datasets (approximately) lie in based on dimension reduction models: one is low-rank estimation models and the other is embedding learning models. For data matrices, low-rank estimation is to recover an underlying data matrix, subject to the constraint the matrix is of reduced rank. Such analysis is also generalized to the high-dimensional higher-order tensor data. Meanwhile, embedding learning models are to directly project the observation data into a low-dimensional vector space.
In the first part, the theoretical analysis of low-rank estimation models is established in the regime of high-dimensional statistics. For matrices, the low-rank structure corresponds to the sparsity of the singular values; while for tensors, the low-rank model can be defined as the low-rankness of the unfolding matrices of the tensor. To achieve low-rank solutions, two categories of regularization are imposed. Firstly, the problem of robust tensor decomposition with gross corruption is considered. To recover the underlying true tensor and corruption of large magnitude, structure assumptions of low-rankness and sparsity are imposed on the tensor and corruption, respectively. The Schatten-1 norm is applied as convex regularization for the low-rank structure. Secondly, the problem of matrix estimation is considered with a nonconvex penalty. Compared with convex regularization, nonconvex penalty takes advantage of the large singular values, which leads to faster statistical convergence rate and oracle property under a mild condition on the magnitude of the singular values. For both problems, efficient optimization algorithms are proposed, and extensive numerical experiments are conducted to corroborate the efficacy of the proposed algorithms and the theoretical analysis.
In the second part, embedding learning models for real-world applications are presented. The high-dimensional data is projected into a low-dimensional vector space via preserving the proximity among objects. Each object is represented by a low-dimensional vector, called embedding or distributed representation. In the first application, the heterogeneity of the objects is considered. Based on the observation that several interactions among the strongly-typed objects happen simultaneously as an event, the embeddings of objects in each event are learned as a whole. In other words, the model preserves the proximity among all the participating objects in each event. Experimental results provide evidence that the learned embeddings are more effective while being robust to data sparsity and noises for various classification tasks. In the second application, the task of expert finding is studied, which is to rank candidates with appropriate expertise based on a given query. To capture the subtle semantic information regarding specific queries with narrow semantic meanings, locally-trained embedding learning with concept hierarchy as guidance is proposed for query expansion. The locally-trained embeddings preserve the proximity among terms constrained on a sub-corpus. Compared with global embedding trained on the whole dataset, locally-trained embedding has stronger representation power. Experimental results show that the proposed embedding learning method achieves high precision regarding the task of expert finding.
To summarize, this thesis provides important results of low-rank estimation and embedding learning models for high-dimensional data analysis and real-world applications.