Interrater Reliability

  • Consensus in observations among different people (assumes the content and conditions of the observation are the same)
  • Degree of scoring stability a measure yields across multiple observers
  • Measure of agreement among raters

Rater Effects on Reliability

  • Halo Effect (impressions on one dimension may influence on others)
  • Stereotyping (impressions on entire group may influence individual)
  • Perception Differences (rater viewpoints and past experiences)
  • Leniency/Stringency Error (lack of knowledge)
  • Scale Shrinking (not applicable here)

Calculating Interrater Reliability

  • R = number of agreements / number of agreements + number of disagreements
  • Calculate for individual agreement to "right answers"
  • Calculate for each item across all raters to "right answer"
  • Calculate variability (range, standard error)

Minimizing Rater Effects and Improving Rater Reliability

  • Through Training
  • Through Training Exercise Analysis
  • Scoring guide revisions
  • Through Instrument Revisions

Rater Training

  • Familiarize raters with measures that they will be working with
  • Ensure that raters understand the sequence of operations they must perform
  • Mutual understanding and clarification of rationales
  • Via the process of finding joint agreements and documentation

Stat Med 1990 Sep;9(9):1103-15

Measuring interrater reliability among multiple raters: an example of methods for nominal data.

Posner KL, Sampson PD, Caplan RA, Ward RJ, Cheney FW

Department of Anesthesiology, University of Washington, Seattle 98195.

This paper reviews and critiques various approaches to the measurement of reliability among multiple raters in the case of nominal data. We consider measurement of the overall reliability of a group of raters (using kappa-like statistics) as well as the reliability of individual raters with respect to a group. We introduce modifications of previously published estimators appropriate for measurement of reliability in the case of stratified sampling frames and we interpret these measures in view of standard errors computed using the jackknife. Analyses of a set of 48 anaesthesia case histories in which 42 anaesthesiologists independently rated the appropriateness of care on a nominal scale serve as an example.