|Abstract:||In today's computerized and information-based society, text data is rich but often also "messy". We are inundated with vast amounts of text data, written in different genres (from grammatical news articles and scientific papers to noisy social media posts), covering topics in various domains (e.g., medical records, corporate reports, and legal acts). Can computational systems automatically identify various real-world entities mentioned in a new corpus and use them to summarize recent news events reliably? Can computational systems capture and represent different relations between biomedical entities from massive and rapidly emerging life science literature? How might computational systems represent the factual information contained in a collection of medical reports to support answering detailed queries or running data mining tasks?
While people can easily access the documents in a gigantic collection with the help of data management systems, they struggle to gain insights from such a large volume of text data: document understanding calls for in-depth content analysis, content analysis itself may require domain-specific knowledge, and over a large corpus, a complete read and analysis by domain experts will invariably be subjective, time-consuming and relatively costly. To turn such massive, unstructured text corpora into machine-readable knowledge, one of the grand challenges is to gain an understanding of the typed entity and relation structures in the corpus. This thesis focuses on developing principled and scalable methods for extracting typed entities and relationship with light human annotation efforts, to overcome the barriers in dealing with text corpora of various domains, genres and languages. In addition to our effort-light methodologies, we also contribute effective, noise-robust models and real-world applications in two main problems:
- Identifying Typed Entities: We show how to perform data-driven text segmentation to recognize entities mentioned in text as well as their surrounding relational phrases, and infer types for entity mentions by propagating "distant supervision" (from external knowledge bases) via relational phrases. In order to resolve data sparsity issue during propagation, we complement the type propagation with clustering of functionally similar relational phrases based on their redundant occurrences in large corpus. Apart from entity recognition and coarse-grained typing, we claim that fine-grained entity typing is beneficial for many downstream applications and very challenging due to the context-agnostic label assignment in distant supervision, and we present principled, efficient models and algorithms for inferring fine-grained type path for entity mention based on the sentence context.
- Extracting Typed Entity Relationships: We extend the idea of entity recognition and typing to extract relationships between entity mentions and infer their relation types. We show how to effectively model the noisy distant supervision for relationship extraction, and how to avoid the error propagation usually happened in incremental extraction pipeline by integrating typing of entities and relationships in a principled framework. The proposed approach leverages noisy distant supervision for both entities and relationships, and simultaneously learn to uncover the most confident labels as well as modeling the semantic similarity between true labels and text features.
In practice, text data is often highly variable: corpora from different domains, genres or languages have typically required for effective processing a wide range of language resources (e.g., grammars, vocabularies, and gazetteers). The “massive” and “messy” nature of text data poses significant challenges to creating tools for automated extraction of entity and relation structures that scale with text volume. State-of-the-art information extraction systems have relied on large amounts of task-specific labeled data (e.g., annotating terrorist attack-related entities in web forum posts written in Arabic), to construct machine-learning models (e.g., deep neural networks). However, even though domain experts can manually create high-quality training data for specific tasks as needed, both the scale and efficiency of such a manual process are limited. This thesis harnesses the power of ``big text data'' and focuses on creating generic solutions for efficient construction of customized machine-learning models for mining typed entities and relationships, relying on only limited amounts of (or even no) task-specific training data. The approaches developed in the thesis are thus general and applicable to all kinds of text corpora in different natural languages, enabling quick deployment of data mining applications. We provide scalable algorithmic approaches that leverage external knowledge bases as sources of supervision and exploit data redundancy in massive text corpora, and we show how to use them in large-scale, real-world applications, including structured exploration and analysis of life sciences literature, extracting document facets from technical documents, document summarization, entity attribute discovery, and open-domain information extraction.