Files in this item



application/pdfChih-Kai_Lin.pdf (722kB)
(no description provided)PDF


Title:Issues and challenges in current generalizability theory applications in rated measurement
Author(s):Lin, Chih-Kai
Director of Research:Zhang, Jinming; Davidson, Frederick G.
Doctoral Committee Chair(s):Zhang, Jinming
Doctoral Committee Member(s):Davidson, Frederick G.; Ryan, Katherine E.; Anderson, Carolyn J.
Department / Program:Educational Psychology
Discipline:Educational Psychology
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):generalizability theory (G theory)
Monte Carlo simulation
variance component estimates
sparse data
performance-based assessment
standards-to-standards correspondence
English language learners (ELLs)
Abstract:The current dissertation looks into issues and challenges regarding the use of generalizability theory or G theory (Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson & Webb, 1991) in rated measurement given by human raters. Contexts in which such measurement prevails include, but are not limited to, performance-based assessments, standard settings, and content validation studies. Inherent in expert rated measurement are potential systematic and random variations that can contribute to measurement errors, and thereby affect measurement reliability. Examples of systematic variability (i.e., facets in G-theory terminology) are differences in rater severity/leniency, variations in rater interpretations of scoring criteria, and interactions of these facets with the objects of measurement (i.e., the subjects on which the intended construct is measured), whereas random variability reflects unexpected fluctuations in the rating process. Given that the utility of any rated measurement is contingent upon its reliability, analytical tools for disentangling variability in the objects of measurement from variations associated with measurement facets and associated with random errors are necessary. To this end, G theory provides a powerful analytical framework that allows investigators to tease out true differences among the objects of measurement and to assess the relative magnitude of construct-irrelevant variability. This dissertation follows a multi-paper approach and includes six chapters, including an introduction, four individual papers pertaining to theoretical and applied investigations of G theory in rated measurement, and a conclusion. The introduction (Chapter 1) sketches an overarching theme that situates the separate papers in a thematic unity and also provides a brief summary of each paper. Next, the first paper (Chapter 2) reports on findings from comparing two analytical methods, under the G-theory framework, which are designed to analyze sparse rated data commonly observed in performance-based assessments. The rater method identifies blocks of fully crossed sub-datasets and then estimates variance components based on a weighted average across these sub-datasets, while the rating method forces a sparse dataset to be a fully crossed one by conceptualizing ratings as a random facet and then estimates variance components by the usual crossed-design procedures. This paper aims to compare the estimation precision of the two methods via a Monte Carlo simulation study and an empirical study. Results show that when all raters are expected to be homogeneous in their score variability, either method has good estimates of variance components. However, when some raters exhibit more variability in their ratings than others, the rater method yields more precise estimates than the rating method. The second paper (Chapter 3) is carried out in the context of examining correspondence between English language proficiency (ELP) standards and academic content standards in the US K-12 setting. Such correspondence studies provide information about the extent to which English language learners are expected to encounter academic language use closely associated with academic disciplines, such as mathematics. This paper describes one approach to conducting ELP standards-to-standards correspondence research based on reviewer judgments, and it also touches on reviewer consistency in judging the cognitive complexity of the target standards. Results suggest that there seems to be a relationship between reviewer consistency in their judgments and the level of specificity in the target standards. As an extension of the second paper, the third paper (Chapter 4) seeks to advance new applications of G theory in correspondence research and to examine reviewer reliability in relation to the numbers of raters. Ratings of the cognitive complexity germane to language performance indicators were collected from 28 correspondence studies with over 700 trained reviewers, consisting of content-area experts and English as a second language (ESL) specialists. Under the G-theory framework, reviewer reliability and standard errors of measurement in their ratings are evaluated with respect to the numbers of reviewers. Results show that depending on the particular grades and subject areas, 3-6 reviewers are needed to achieve an acceptable level of reliability and to control for a reasonable amount of measurement errors in their ratings. The fourth paper (Chapter 5) attempts to advance the discussion of nonadditivity in the context of G-theory applications in rated measurement. Nonadditivity occurs when some or all of the main and interaction effects, pertaining to the objects of measurement and measurement facet(s), are significantly correlated. As such, the paper analytically and empirically illustrates the distinction between additive and nonadditive one-facet G-theory models. In addition, the paper aims to explore existing statistical procedures of detecting nonadditivity in data. Tukey's single-degree-freedom test for nonadditivity is evaluated in terms of Type I error and statistical power. Results show that the test is satisfactory in controlling for occurrences of erroneously identifying nonadditivity (Type I error) and that the test is successful in identifying one type of nonadditive interaction (power). As will become clear in the dissertation, the first and fourth papers are motivated by methodological challenges in advancing G-theory applications in the field of educational measurement, while the second and third papers are motivated by validity issues in assessing the content knowledge of young English language learners in the field of language testing. Finally, the conclusion (Chapter 6) functions as a discussion of some unsolved issues in G-theory applications and ideas for future research. First, issues regarding the use of many-facet Rasch measurement to complement G-theory analysis are discussed. Second, given that a performance test usually involves examinee responses being rated on a discrete ordinal scale, the consideration of the discrete ordinal nature in measurement variables under the G-theory framework is an unsolved area of research. Finally, nonadditivity in multi-faceted G-theory models is also an area that deserves more research efforts because most performance tests would entail more than one measurement facet, such as those associated with raters and tasks.
Issue Date:2014-09-16
Rights Information:Copyright 2014 Chih-Kai Lin
Date Available in IDEALS:2014-09-16
Date Deposited:2014-08

This item appears in the following Collection(s)

Item Statistics