Files in this item



application/pdfMASSUNG-THESIS-2015.pdf (273kB)Restricted to U of Illinois
(no description provided)PDF


Title:Non-native text analysis with Syntactic Diff, a general comparative text mining framework
Author(s):Massung, Sean Alexander
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):text mining
natural language processing
comparative text mining
non-native text analysis
non-native text mining
second language education
non-native English speakers
Abstract:Non-native speakers of English far outnumber native speakers; English is the main language of books, newspapers, airports, air-traffic control, international business, academic conferences, science, technology, diplomacy, sports, international competitions, pop music, and advertising [1]. Online education in the form of MOOCs (massive online open courses) is also primarily in English— even teaching English. This creates enormous amounts of text written by non- native speakers, which in turn generates a need for grammar correction and analysis. Even aside from MOOCs, the number of English learners only in Asia alone is in the tens of millions. In response to this powerful motivation, we describe SYNTACTIC DIFF, a novel edit-based method for transforming sequences of words given a reference corpus. These transformations can be used directly or can be employed as features to represent text data in a wide variety of text mining scenarios. As case studies, we apply SYNTACTIC DIFF to four quite different tasks in non-native text analysis and show its benefit in each case. In the first task, we use weighted word edits with likelihood scoring for grammatical error correction. Our method is compared against systems in a grammar correction shared task, and we find that SYNTACTIC DIFF edits perform comparably while being much more general than the other methods. The second task is native language identification: a classification problem predicting the native language of a student writer based on English essays. We represent documents as vectors of edits, and show that a combination of unigram words and SYNTACTIC DIFF edits outperforms each representation individually. The third task is fluency scoring, in which we see if the manually categorized fluency levels of English students can be modeled by SYNTACTIC DIFF features. In the fourth task, we create clusters of student essays with similar errors via topic modeling, and find that the interpretability is significantly higher than an n-gram words approach. SYNTACTIC DIFF is highly customizable and able to capture syntactic differences from a reference corpus at the sentence, document, and subcorpus levels. This enables both a rich translation method and feature representation for many text mining tasks that deal with word usage and syntax beyond bag- of-words. In particular, this thesis focuses on non-native text analysis applications, though SYNTACTIC DIFF is not at all limited to that domain.
Issue Date:2015-04-15
Rights Information:Copyright 2015 Sean Massung
Date Available in IDEALS:2015-07-22
Date Deposited:May 2015

This item appears in the following Collection(s)

Item Statistics