Files in this item



application/pdfDomain Adaptati ... al Language Processing.pdf (594kB)
(no description provided)PDF


Title:Domain Adaptation in Natural Language Processing
Author(s):Jiang, Jing
Subject(s):natural language processing
Abstract:With the fast growth of the amount of digitalized texts in recent years, text information management becomes increasingly important in people's daily life. Natural language processing provides the foundation of many modern text information management technologies. For many natural language processing tasks, the state-of-the-art solutions are based on supervised statistical machine learning methods, which require large manually annotated corpora. However, the variations of text in vocabulary, format, style, etc. in different domains and the large amount of human efforts needed to create labeled training data make it practically infeasible to directly apply supervised machine learning methods to natural language processing tasks in new domains. There is therefore a great need to develop special learning algorithms and techniques to adapt classifiers trained on some old domains to a different but related new domain. This thesis aims at understanding the domain adaptation problem and developing general learning techniques for solving the problem. To understand domain adaptation, a formal analysis is conducted from different perspectives. First, we look at the intrinsic distributional difference between two domains, which leads to an instance weighting solution to domain adaptation. Second, we look at the the extrinsic functional difference between the optimal classifiers for two domains, which leads to a feature selection solution to domain adaptation. Third, we distinguish the domain difference that comes from the old training domain from the difference that comes from the new test domain, and accordingly propose that domain adaptation should consist of two stages. The instance weighting and feature selection solutions are formally developed into two general and principled frameworks for domain adaptation. Both frameworks modify the objective function of the standard risk minimization framework for supervised learning, and include standard supervised learning and semi-supervised learning as special cases. Evaluation of the two frameworks on a number of natural language processing tasks using real data sets demonstrates the effectiveness of the domain adaptation techniques incorporated in the frameworks compared with standard supervised and semi-supervised learning. Observing that the effectiveness of different domain adaptation techniques varies from data set to data set, we also study different types of domain adaptation and their associations with different domain adaptation techniques. Using perturbed real data sets, we are able to show that different types of domain difference indeed require different domain adaptation techniques. This analysis deepens our understanding of domain adaptation, and potentially helps us select the appropriate techniques for particular domain adaptation problems. Although we focus on domain adaptation in natural language processing in this thesis, most of the analysis of the problem and the proposed domain adaptation techniques are not restricted to natural language processing problems but can be generally applied to most classification tasks when the training and the test domains differ.
Issue Date:2008-07
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2008-2974
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-23

This item appears in the following Collection(s)

Item Statistics