|Abstract:||As a special kind of ``big data'', text data can be regarded as data reported by human sensors. Since humans are far more intelligent than physical sensors, text data contains directly useful information and knowledge about the real world, making it possible to make predictions about real-world phenomena based on text. As all application domains involve humans, text-based prediction has widespread applications, especially for optimization of decision making.
While the problem of text-based prediction resembles text classification when formulated as a supervised learning problem, it is more challenging because the variable to be predicted is generally not directly mentioned in the text data and thus there is a ``semantic gap'' between the target variable and the surface features that are often used for representing text data in conventional approaches. How to bridge such a gap is a key technical challenge, but has not been well studied in the existing work. In this thesis, we propose to leverage the increasingly available knowledge graphs on the Web to bridge this gap. We propose to bridge this gap by using knowledge graph to make text representation more focused on elements in a knowledge graph that are relevant to the prediction task. We mainly focus on two a family of text-based prediction -- entity-centric classification and regression where the response variable can be treated as an attribute of a group of central entities.
As a form of knowledge representation, knowledge graphs have widespread applications in information retrieval, text mining, and natural language processing. Many knowledge graphs have been constructed and applied to diverse, real-world applications. The knowledge graph can help to enhance the interpretability of the textual information from the perspective of predictive analytics, and hence discovers more effective features. Despite the great success made in the application of knowledge graph in various domains, one of the main deficiencies of many existing works is that the knowledge graph applied in the application is pre-constructed, which remains unchanged when applied to very different specific application tasks. Such a static task-independent knowledge graph, while useful, is non-optimal for any specific application due to the unnecessary cost from processing large amounts of non-relevant knowledge as well as the insufficient coverage of task-specific knowledge.
To address this limitation, we propose to construct a task-aware knowledge graph (TAKG) which would only contain the relevant knowledge to a particular task and absorb additional relevant knowledge from the data used in a particular task. We present a general formal framework for constructing a task-aware knowledge graph, develop specific algorithms for constructing a task-aware knowledge graph for entity-centric prediction in both knowledge-based and task-dependent ways, and apply it to a movie review categorization task.
We propose two methods to expand the knowledge graph. One is to discover new entities and relations by a jointly embedding model which learns embedding vector for each entity and relation. In this way, the specific relationships in a finer-granularity that is pre-defined by the knowledge graph can be identified between related entities. An alternative way is to use more general word relations, e.g., paradigmatic and syntagmatic relation to expand the knowledge graph by including loosely related entities. Both methods work under certain circumstances, but the former one is helpful in a wider range of applications.
We also make a systematic study of knowledge graph assisted feature engineering. We propose several different ways to construct knowledge graph-based features and investigate their performance in multiple real applications. Our study shows that different types of application may favor different ways of constructing knowledge graph-based features. We find that the coverage of the knowledge graph is important. If it cannot provide sufficient background knowledge, the effectiveness of the knowledge graph-based features will be impacted. Besides, the generated knowledge graph-based features can sometimes be very noisy, especially when the correlation between text and the response variables are weak. To distinguish the signal features from the noise, we propose a two-stage filtering method to further prune the features. Our experiment result shows that the pruned knowledge graph-based features have strong predictive power, which again confirms that leveraging text data is promising for real-world phenomenon prediction.