|Abstract:||With the fast growth of the amount of digitalized texts in recent years, text information management becomes increasingly important in people's daily life. Natural language processing provides the foundation of many modern text information management technologies. For many natural language processing tasks, the state-of-the-art solutions are based on supervised statistical machine learning methods, which require large manually annotated corpora. However, the variations of text in vocabulary, format, style, etc. in different domains and the large amount of human efforts needed to create labeled training data make it practically infeasible to directly apply supervised machine learning methods to natural language processing tasks in new domains. There is therefore a great need to develop special learning algorithms and techniques to adapt classifiers trained on some old domains to a different but related new domain.
This thesis aims at understanding the domain adaptation problem and developing general learning techniques for solving the problem. To understand domain adaptation, a formal analysis is conducted from different perspectives. First, we look at the intrinsic distributional difference between two domains, which leads to an instance weighting solution to domain adaptation. Second, we look at the the extrinsic functional difference between the optimal classifiers for two domains, which leads to a feature selection solution to domain adaptation. Third, we distinguish the domain difference that comes from the old training domain from the difference that comes from the new test domain, and accordingly propose that domain adaptation should consist of two stages.
The instance weighting and feature selection solutions are formally developed into two general and principled frameworks for domain adaptation. Both frameworks modify the objective function of the standard risk minimization framework for supervised learning, and include standard supervised learning and semi-supervised learning as special cases. Evaluation of the two frameworks on a number of natural language processing tasks using real data sets demonstrates the effectiveness of the domain adaptation techniques incorporated in the frameworks compared with standard supervised and semi-supervised learning.
Observing that the effectiveness of different domain adaptation techniques varies from data set to data set, we also study different types of domain adaptation and their associations with different domain adaptation techniques. Using perturbed real data sets, we are able to show that different types of domain difference indeed require different domain adaptation techniques. This analysis deepens our understanding of domain adaptation, and potentially helps us select the appropriate techniques for particular domain adaptation problems.
Although we focus on domain adaptation in natural language processing in this thesis, most of the analysis of the problem and the proposed domain adaptation techniques are not restricted to natural language processing problems but can be generally applied to most classification tasks when the training and the test domains differ.