Date of Award
Campus Access Dissertation
Cohen's κ (1960) is almost universally used for the assessment of the strength of agreement among raters, each classifying subjects as e.g. diseased or not. Nelson and Edwards (2008) propose a generalized linear mixed model for the agreement process, showing that Cohen's κ can seriously underestimate agreement and proposing a model-based coefficient κM which does not suffer from this flaw (at least when the assumed model is correct). They discuss computationally intensive methods for estimation and inference on κM. This paper builds on this previous work, adding theoretical justification and practical methods for estimation and inference. Their model for the agreement process is motivated by a threshold model with latent crossed item - rater random effects plus interaction and pure error. Under this framework it is shown that κM is a monotone function of the Pearson (1900; 1913) tetrachoric correlation ρ between the latent effects of any two raters. A practical method for estimation and inference on κM which bypasses computationally- intensive model fitting is defined and studied; this approach is analogous to that proposed by Pearson for estimation of ρ in a 2 × 2 table. Asymptotic estimator behavior is established using a generalization of Lehmann's theory of two-sample U-statistics (1951). Standard errors are defined using this asymptotic approximation, Owen's pigeonhole bootstrap (2007) and a counting statistics approach. Simulation studies suggest that the new estimation methods have negligible bias and with the bootstrap approach can perform well in terms of confidence interval coverage when the number of items/raters is on the order of 60 or more. For the robustness of the new estimation method, the bootstrap approach can perform well under the assumption that both random effects follow a t distribution with moderate-to-large degree of freedom (df > 20), with the probit link function. Under the logit link function, or a probit link with low degrees of freedom (df = 3) for either of the random effects, actual coverage of nominal 90% confidence intervals for ρ could be unacceptable low.
A function KappaM written in R (R Development Core Team, 2005) is provided which creates agreement image plots and calculates parameter estimates with bootstrap standard errors. The methods are illustrated with both a prostate biopsy example (Allsbrook et al., 2001) and a mammography example (Beam et al., 1996, 2003); the latter example shows substantial differences between κM and κ under high prevalence.
Gao, J.(2012). Model-Based Measures of Interrater Agreement. (Doctoral dissertation). Retrieved from https://scholarcommons.sc.edu/etd/2605