Files in this item



application/pdfXin_Wang.pdf (2MB)
(no description provided)PDF


Title:Beyond CTT and IRT: using an interactional measurement model to investigate the decision making process of EPT essay raters
Author(s):Wang, Xin
Director of Research:Davidson, Frederick G.
Doctoral Committee Chair(s):Davidson, Frederick G.
Doctoral Committee Member(s):Chang, Hua-Hua; Christianson, Kiel; Sadler, Randall W.
Department / Program:Educational Psychology
Discipline:Educational Psychology
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):English as a second language (ESL) writing test
rater decision making
performance assessment
test validity
test reliability
Abstract:The current study as a doctorate dissertation investigates the gap between the nature of ESL performance tests and score-based analysis tools used in the field of language testing. The purpose of this study is hence to propose a new testing model and a new experiment instrument to examine test validity and reliability through rater’s decision making process in an ESL writing performance test. A writing test as a language performance assessment is a multifaceted entity that involves the interaction of various stakeholders, among whom essay raters have a great impact on essay scores due to their subjective scoring decision, hence influencing the test validity and reliability (Huot, 1990; Lumley, 2002). This understanding puts forward the demand on the development and facilitation of methodological tools to quantify rater decision making process and the interaction between rater and other stakeholders in a language test. Previous studies within the framework of Classic Testing Theory (CTT) and Item Response Theory (IRT) mainly focus on the final outcome of rating or the retrospective survey data and/or rater’s think-aloud protocols. Due to the limitation of experimental tools, very few studies, if any, have directly examined the moment-to-moment process about how essay raters reach their scoring decisions and the interaction per se. The present study proposes a behavioral model for writing performance tests, which investigates raters’ scoring behavior and their reading comprehension as combined with the final essay score. Though the focus of this study is writing assessment, the current research methodology is applicable to the field of performance-based testing in general. The present framework considers the process of a language test as the interaction between test developer, test taker, test rater and other test stakeholders. In the current study focusing on writing performance iii test, the interaction between test developer and test taker is realized directly through test prompt and indirectly through test score; on the other hand, the interaction between test taker and test rater is reflected in the writing response. This model defines and explores rater reliability and test validity via the interaction between text (essays written by test-takers) and essay rater. Instead of indirectly approaching the success of such an interaction through the final score, this new testing model directly measures and examines the success of rater behaviors with regard to their essay reading and score decision making. Bearing the “interactional” nature of a performance test, this new model is named as the Interactional Testing Model (ITM). In order to examine the online evidence of rater decision making, a computer-based interface was designed for this study to automatically collect the time-by-location information of raters’ reading patterns, their text comprehension and other scoring events. Three groups of variables representing essay features and raters’ dynamic scoring process were measured by the rating interface: 1) Reading pattern. Related variables include raters’ reading rate, raters’ go-back rate within and across paragraphs, and the time-by-location information of raters’ sentence selection. 2) Raters’ reading comprehension and scoring behaviors. Variables include the timeby- location information of raters’ verbatim annotation, the time-by-location information of raters’ comments, essay score assignment, and their answers to survey questions. 3) Essay features. The experiment essays will be processed and analyzed by Python and SAS with regard to following variables: a) word frequency, b) essay length, c) total number of subject-verb mismatch as the indicator of syntactic anomaly, d) total number of clauses and sentence length as the indicators of syntactic complexity, e) total number and location of inconsistent anaphoric referent as the indicator of discourse incoherence, and f) density and word frequency of sentence iv connectors as indicators of discourse coherence. The relation between these variables and raters’ decision making were investigated both qualitatively and quantitatively. Results from the current study are categorized to address the following themes: 1) Rater reliability: The rater difference occurred not only in their score assignment, but also in raters’ text reading and scoring focus. Results of inter-rater reliability coincided with findings from raters' reading time and their reading pattern. Those raters who had a high reading rate and low reading digression rate were less reliable. 2) Test validity: Rater attention was assigned unevenly across an essay and concentrated on essay features associated to “Idea Development”. Raters’ sentence annotation and scoring comments also demonstrated a common focus on this scoring dimension. 3) Rater decision making: Most raters demonstrated a linear reading pattern during their text reading and essay grading. A rater-text interaction has been observed in the current study. Raters' reading time and essay score were strongly correlated with certain essay features. A difference between trained rater and untrained rater was observed. Untrained raters tend to over emphasis the importance of "grammar and lexical choice". As a descriptive framework in the study of rating, the new measurement model bears both practical and theoretical significance. On the practical side, this model may shed light on the development of the following research domains: 1) Rating validity and rater reliability. In addition to looking at raters’ final score assignments, IRM provides a quality control tool to ensure that a rater follows rating rubrics and assigns test scores in a consistent manner; 2) Electronic essay grading. Results from this study may provide helpful information to the design and validation of an automated rating engine in writing assessment. On the theoretical side, as a supplementary model to IRT and CTT, this model may enable researchers to go beyond simple v post hoc analysis of test score and get a deeper understanding of raters’ decision making process in the context of a writing test.
Issue Date:2014-05-30
Rights Information:Copyright 2014 Xin Wang
Date Available in IDEALS:2014-05-30
Date Deposited:2014-05

This item appears in the following Collection(s)

Item Statistics