Inter-Rater Agreement in Laryngology: Are We Measuring It Right?
Objective: Kappa statistics are the most appropriate for measuring agreement or reliability; however, inadequate statistics are often used to report agreement, such as correlation coefficients and consensus between raters. In 2005, Rosen et al. questioned the reliability of video-laryngeal stroboscopy (VLS) as a research tool due to low inter-rater agreement, and in 2015, a systematic review was published in which 2 out of 80 studies reported “almost perfect” agreement, and only 3 out of 80 used Kappa statistics. Currently, it is unclear if the field has improved. The objective of this work is to review the literature from the past 10 years reporting inter-rater agreement in VLS assessments and determine if the right statistics are being used and if higher inter-rater agreement is being obtained Methods/Design: A systematic review was conducted using PubMed, Embase, Cochrane, and Scielo. Six hundred fifty-four studies were retrieved, and after full-text review, 62 studies were included in the review. Results: 52% of the studies had 2 raters, and 36% had 3. 37% of the studies used consensus as a form of agreement between raters; however, no study defined how consensus was reached. 11% used Cohen’s Kappa only, and 7% used Fleiss’ Kappa. Among those, only 4 studies reached “almost perfect agreement” in at least one of the VLS measures. 5 studies used more than one method of agreement between raters. When an actual statistical method was performed, only 7% of the studies reported the cut-off values for significant agreement. 18% of studies used some type of correlation coefficient, but there was no description of the normality of the data.
Conclusions: The lack of description of consensus methodology creates uncertainty about whether one’s opinion or level of expertise influenced the results. The absence of cut-off values for Kappa statistics leaves readers unsure about how accurate the inter-rater agreement truly was and whether the results can be trusted. Rosen’s proposed benchmark (κ = 0.81) may be difficult to achieve in laryngology studies, which often have limited variability across cases. Most VLS studies reported persistently low inter-rater agreement despite ongoing standardization efforts. There is a need to re-examine reliability, how it is defined, measured, and applied in VLS research.