Guides

4 min read

Cohen's Kappa Calculator: A Complete Guide to Inter-Rater Reliability

Inter-rater reliability is a methodological requirement for systematic reviews, and Cohen's kappa is the standard statistic. This guide explains when kappa applies, how to interpret the Landis and Koch scale, and what to do when prevalence or bias inflates or deflates your result.

Dr. Sarah Mitchell

April 4, 2026

Want to try this yourself? Use our free research tools, no sign-up required.

Key Takeaways

Cohen's kappa corrects raw agreement for chance, making it the appropriate reliability statistic for systematic review screening.

The Landis and Koch scale classifies kappa from slight (0.00 to 0.20) to almost perfect (0.81 to 1.00). Substantial agreement (0.61+) is the typical minimum.

Unweighted kappa is correct for binary decisions. Weighted kappa applies to ordinal scales.

Need professional help with your research?

Our PhD experts deliver complete systematic reviews and meta-analyses, from protocol to manuscript.

Get a Free Quote

Written by

Dr. Sarah Mitchell

PhD, Biostatistics & Research Methodology

Dr. Sarah Mitchell holds a PhD in Biostatistics from Johns Hopkins Bloomberg School of Public Health and has over 15 years of experience in systematic review methodology and meta-analysis. She has authored or co-authored 40+ peer-reviewed publications in journals including the Journal of Clinical Epidemiology, BMC Medical Research Methodology, and Research Synthesis Methods. A former Cochrane Review Group statistician and current editorial board member of Systematic Reviews, Dr. Mitchell has supervised 200+ evidence synthesis projects across clinical medicine, public health, and social sciences.

Learn more about our team

Need professional help with your systematic review or meta-analysis? Get a free quote from our team of PhD researchers.

Ready to Start Your Research Project?

Our PhD-level methodologists handle every stage: systematic reviews, meta-analyses, scoping reviews, and more. PRISMA 2020-compliant, publication-ready.

Get a Free Quote View Pricing

Every systematic review that involves human raters making categorical decisions requires a measure of agreement that goes beyond simple percentage overlap. Cohen's kappa corrects for the agreement you would expect by chance alone, producing a statistic that reflects only the genuine concordance between raters.

This guide covers the full practical workflow: when kappa is appropriate, how to use our free calculator, how to interpret results using the Landis and Koch benchmarks, and how to handle situations where kappa can mislead you.

When Cohen's Kappa Applies

Kappa is appropriate when you have exactly two raters assigning the same set of items to the same nominal categories. For three or more raters, use Fleiss' kappa or intraclass correlation. For continuous measurements, use the intraclass correlation coefficient (ICC) via our ICC Calculator.

Unweighted Versus Weighted Kappa

Unweighted kappa treats all disagreements as equivalent. Use it for binary screening decisions (include versus exclude).

Weighted kappa assigns differential penalties based on distance between categories. Use it for ordinal scales like risk-of-bias ratings (low, unclear, high).

Using the Kappa Calculator

Navigate to the Kappa Calculator and enter your data as a contingency table or item-by-item ratings. The output includes: observed agreement (po), expected agreement (pe), Cohen's kappa, standard error and 95% confidence interval, Landis and Koch benchmark category, prevalence index, and bias index.

Interpreting Kappa: The Landis and Koch Scale

Kappa Value	Strength of Agreement
Below 0.00	Poor
0.00 to 0.20	Slight
0.21 to 0.40	Fair
0.41 to 0.60	Moderate
0.61 to 0.80	Substantial
0.81 to 1.00	Almost Perfect

For systematic review screening, kappa at or above 0.61 (substantial agreement) is generally the minimum acceptable level.

A kappa below 0.41 indicates that inclusion criteria are ambiguous or raters have not calibrated consistently. Re-examine criteria definitions and conduct a calibration exercise.

The Prevalence and Bias Problem

Prevalence effect: When inclusion rates are very low (common in screening), kappa can be substantially lower than the observed agreement suggests. Two raters with 98% raw agreement can have kappa of only 0.50. When the prevalence index is high, interpret kappa with caution and report raw agreement alongside it.

Bias effect: When raters use categories at systematically different rates, kappa is deflated. This signals a calibration problem.

Practical Screening Workflow

Pilot calibration: Both reviewers screen 50 to 100 random records. Compute kappa, discuss disagreements, revise criteria if needed.
Full independent screening.
Kappa computation with prevalence and bias checks.
Disagreement resolution through discussion or a third reviewer.
Reporting in the methods section.

For chi-square-based tests of association, see our Chi-Square Calculator.

Key Takeaways

Cohen's kappa corrects raw agreement for chance, making it the appropriate reliability statistic for systematic review screening.
The Landis and Koch scale classifies kappa from slight (0.00 to 0.20) to almost perfect (0.81 to 1.00). Substantial agreement (0.61+) is the typical minimum.
Unweighted kappa is correct for binary decisions. Weighted kappa applies to ordinal scales.
A high prevalence index can depress kappa well below what raw agreement suggests. Report both statistics.
A high bias index signals a calibration problem between raters.
Run a pilot calibration before full screening, discuss disagreements, and report kappa with confidence interval in your methods.

FAQ

What kappa value do I need for systematic review screening?

Most guidelines treat kappa at or above 0.61 as the minimum acceptable level. Cochrane reviews require all disagreements be resolved regardless of kappa value.

What is the difference between Cohen's kappa and percentage agreement?

Percentage agreement ignores chance agreement. Cohen's kappa subtracts expected chance agreement, measuring only agreement exceeding chance.

When should I use weighted kappa?

Use weighted kappa when the rating scale is ordinal (like low, unclear, high risk of bias). For binary decisions, use unweighted kappa.

Can I calculate kappa for more than two raters?

Standard Cohen's kappa applies to exactly two raters. For three or more, use Fleiss' kappa. For continuous ratings, use our ICC Calculator.

My kappa is high but prevalence index is also high. What does this mean?

The vast majority of items fall into one category. Kappa is inherently unstable in this situation. Report raw observed agreement alongside kappa and consider prevalence-adjusted bias-adjusted kappa (PABAK).

Need help with your systematic review or meta-analysis? Get a free quote from our team of PhD researchers.

Cohen's Kappa Calculator: A Complete Guide to Inter-Rater Reliability

Key Takeaways

Dr. Sarah Mitchell

Ready to Start Your Research Project?

When Cohen's Kappa Applies

Unweighted Versus Weighted Kappa

Using the Kappa Calculator

Interpreting Kappa: The Landis and Koch Scale

The Prevalence and Bias Problem

Practical Screening Workflow

Key Takeaways

FAQ

What kappa value do I need for systematic review screening?

What is the difference between Cohen's kappa and percentage agreement?

When should I use weighted kappa?

Can I calculate kappa for more than two raters?

My kappa is high but prevalence index is also high. What does this mean?

Frequently Asked Questions

Related Articles

Related Articles