Every systematic review that involves human raters making categorical decisions requires a measure of agreement that goes beyond simple percentage overlap. Cohen's kappa corrects for the agreement you would expect by chance alone, producing a statistic that reflects only the genuine concordance between raters.

This guide covers the full practical workflow: when kappa is appropriate, how to use our free calculator, how to interpret results using the Landis and Koch benchmarks, and how to handle situations where kappa can mislead you.

When Cohen's Kappa Applies

Kappa is appropriate when you have exactly two raters assigning the same set of items to the same nominal categories. For three or more raters, use Fleiss' kappa or intraclass correlation. For continuous measurements, use the intraclass correlation coefficient (ICC) via our ICC Calculator.

Unweighted Versus Weighted Kappa

Unweighted kappa treats all disagreements as equivalent. Use it for binary screening decisions (include versus exclude).

Weighted kappa assigns differential penalties based on distance between categories. Use it for ordinal scales like risk-of-bias ratings (low, unclear, high).

Using the Kappa Calculator

Navigate to the Kappa Calculator and enter your data as a contingency table or item-by-item ratings. The output includes: observed agreement (po), expected agreement (pe), Cohen's kappa, standard error and 95% confidence interval, Landis and Koch benchmark category, prevalence index, and bias index.

Interpreting Kappa: The Landis and Koch Scale

Kappa ValueStrength of Agreement
Below 0.00Poor
0.00 to 0.20Slight
0.21 to 0.40Fair
0.41 to 0.60Moderate
0.61 to 0.80Substantial
0.81 to 1.00Almost Perfect

For systematic review screening, kappa at or above 0.61 (substantial agreement) is generally the minimum acceptable level.

A kappa below 0.41 indicates that inclusion criteria are ambiguous or raters have not calibrated consistently. Re-examine criteria definitions and conduct a calibration exercise.

The Prevalence and Bias Problem

Prevalence effect: When inclusion rates are very low (common in screening), kappa can be substantially lower than the observed agreement suggests. Two raters with 98% raw agreement can have kappa of only 0.50. When the prevalence index is high, interpret kappa with caution and report raw agreement alongside it.

Bias effect: When raters use categories at systematically different rates, kappa is deflated. This signals a calibration problem.

Practical Screening Workflow

  1. Pilot calibration: Both reviewers screen 50 to 100 random records. Compute kappa, discuss disagreements, revise criteria if needed.
  2. Full independent screening.
  3. Kappa computation with prevalence and bias checks.
  4. Disagreement resolution through discussion or a third reviewer.
  5. Reporting in the methods section.

For chi-square-based tests of association, see our Chi-Square Calculator.

Key Takeaways

FAQ

What kappa value do I need for systematic review screening?

Most guidelines treat kappa at or above 0.61 as the minimum acceptable level. Cochrane reviews require all disagreements be resolved regardless of kappa value.

What is the difference between Cohen's kappa and percentage agreement?

Percentage agreement ignores chance agreement. Cohen's kappa subtracts expected chance agreement, measuring only agreement exceeding chance.

When should I use weighted kappa?

Use weighted kappa when the rating scale is ordinal (like low, unclear, high risk of bias). For binary decisions, use unweighted kappa.

Can I calculate kappa for more than two raters?

Standard Cohen's kappa applies to exactly two raters. For three or more, use Fleiss' kappa. For continuous ratings, use our ICC Calculator.

My kappa is high but prevalence index is also high. What does this mean?

The vast majority of items fall into one category. Kappa is inherently unstable in this situation. Report raw observed agreement alongside kappa and consider prevalence-adjusted bias-adjusted kappa (PABAK).

Need help with your systematic review or meta-analysis? Get a free quote from our team of PhD researchers.