Cohen's Kappa Calculator

Free

Calculate inter-rater agreement for screening decisions in your systematic review. Enter a 2×2 agreement matrix or a multi-category matrix to get Cohen's kappa, standard error, confidence interval at your chosen level (80%, 90%, 95%, or 99%), and interpretation based on the Landis & Koch scale.

Enter the number of items classified by each rater into the 2×2 agreement matrix.

Rater 2 Include

Rater 2 Exclude

Rater 1 Include

Rater 1 Exclude

κ = (Po − Pe) / (1 − Pe). Po = (a + d) / N. Pe = ((a+b)(a+c) + (c+d)(b+d)) / N².

Confidence Level

Load sample data to see how the tool works, or clear all fields to start fresh.

Results

Cohen's kappa (\u03BA)0.7000

SE(\u03BA)0.0711

95% CI[0.5607, 0.8393]

Observed agreement (Po)0.8500

Expected agreement (Pe)0.5000

Total N100

Prevalence Index0.0500

Bias Index0.0500

PABAK0.7000

Interpretation:

Substantial agreement

Next step

Got kappa. Want the full screening reliability workup?

Dual screening, prevalence-adjusted bias-adjusted kappa, weighted kappa, and a publication-ready agreement section.

Our promise: Free rework on search, screening, or synthesis if reviewers push back.

Quote in minutesPay only after you approve scopePhD methodologistPRISMA 2020 + Cochrane HandbookNDA available on request

Quote my systematic review WhatsApp

Timeline

Most projects deliver in under 2 weeks. We confirm an exact date in your quote.

If reviewers push back

If reviewers question the search, screening, or synthesis, we rework the section free.

Confidentiality

NDA available on request before scope discussion. Your data, study design, and manuscript stay private either way.

How to Use This Tool

Choose Mode

Select 2×2 mode for binary screening or multi-category mode for 3+ classification categories.

Enter Counts

Fill in the agreement matrix cells with the number of items each rater pair classified into each category.

Review Results

See kappa, SE, confidence interval at your selected level (80%, 90%, 95%, or 99%), observed and expected agreement, and the Landis & Koch interpretation.

Copy or Reset

Copy all results to clipboard for your methods section, or reset to start a new calculation.

Want a PhD methodologist to handle the whole project?

Get a complete dual-screening with inter-rater agreement reporting. Free rework on search, screening, or synthesis if reviewers push back. Pay only after you approve scope.

WhatsApp Quote my systematic review

Key Takeaways for Inter-Rater Agreement

Kappa vs. percent agreement

Percent agreement (Po) can be misleadingly high when prevalence is extreme. Kappa corrects for chance agreement and provides a more meaningful measure. Always report both Po and kappa in your methods section.

The prevalence paradox affects kappa

When one category dominates (e.g., 95% excluded), Pe is high and kappa can be low despite high percent agreement. PABAK (2 × Po – 1) adjusts for this effect and is recommended as a supplement.

Landis & Koch benchmarks are guidelines

The Landis & Koch scale is widely used but arbitrary. In clinical screening contexts, kappa ≥ 0.60 is often the minimum acceptable threshold. Context and consequences of disagreement should inform your threshold.

Use weighted kappa for ordinal categories

When categories have an inherent order (e.g., Low/Medium/High risk of bias), weighted kappa assigns partial credit for near-agreement. Linear or quadratic weights are the most common choices.

Agreement Measures in Systematic Reviews

A Cohen's kappa calculator measures the degree of agreement between two raters on categorical classifications, correcting for the proportion of agreement that would occur by chance alone. Cohen (1960) introduced kappa as a reliability coefficient for nominal scales, and it has since become the standard metric for reporting inter-rater agreement in systematic reviews. The Cochrane Handbook for Systematic Reviews of Interventions (Higgins et al., 2023, Chapter 4) requires that reviewers report agreement statistics for both the screening phase and risk of bias assessment, because independent dual review is a cornerstone of systematic review methodology. When prevalence is extreme and the kappa paradox deflates the statistic, Gwet's AC1 provides a prevalence-robust alternative that yields more stable estimates of true agreement. For reviews involving three or more screeners, Fleiss' kappa extends the two-rater framework to multiple raters, while Krippendorff's alpha generalises further by accommodating any number of raters, multiple category types, and incomplete (missing) data.

This kappa statistic tool computes observed agreement (Pₒ), expected agreement by chance (Pₑ), and kappa as (Pₒ − Pₑ) / (1 − Pₑ). Interpretation follows the Landis and Koch (1977) scale: kappa below 0.20 indicates slight agreement, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement. However, kappa is sensitive to prevalence and bias: when one category dominates (e.g., 95% of screened records are excluded), kappa can be low despite high observed agreement, a phenomenon known as the kappa paradox (Feinstein & Cicchetti, 1990). For this reason, systematic review reporting should include both kappa and the observed agreement percentage alongside the prevalence index.

In the systematic review workflow, the screening agreement calculator is applied at multiple stages. During title-and-abstract screening, two reviewers independently classify records as "include," "exclude," or "uncertain," producing a multi-category agreement matrix that this tool handles natively. During full-text eligibility assessment, the same kappa computation quantifies agreement on the final inclusion decision. PRISMA 2020 (Page et al., 2021) expects transparency about the agreement process, including the kappa value, the number of discrepancies, and how disagreements were resolved (discussion, third reviewer adjudication, or consensus). Screening platforms such as Covidence and Rayyan include built-in agreement reporting that automatically computes kappa after dual-independent screening is completed. Conducting calibration exercises and pilot screening on a subset of records before full screening begins is strongly recommended to identify ambiguous criteria and improve agreement before disagreements accumulate at scale.

Beyond screening, kappa is essential for assessing agreement on risk of bias judgments. When two reviewers independently evaluate each study using our RoB 2 assessment tool or our ROBINS-I tool for non-randomized studies, the domain-level judgments (low risk, some concerns, high risk) form an ordinal agreement matrix that kappa can quantify. For continuous-scale reliability, such as agreement on quality scores or extraction of numerical data, the intraclass correlation coefficient (ICC) calculator is the appropriate complement. Together, kappa and ICC provide a complete picture of inter-rater reliability across the categorical and continuous components of a systematic review. When building your extraction form, our data extraction template builder includes built-in fields for recording agreement statistics.

Frequently Asked Questions

What does Cohen’s kappa measure?

Cohen’s kappa (κ) measures the agreement between two raters on categorical classifications, correcting for agreement that would occur by chance alone. Unlike percent agreement (Po), kappa accounts for the expected agreement (Pe) under the assumption of independence. A kappa of 1 means perfect agreement, 0 means agreement equal to chance, and negative values mean agreement worse than chance.

What is the Landis and Koch interpretation scale?

Landis and Koch (1977) proposed a widely used benchmark: κ < 0 = Poor, 0–0.20 = Slight, 0.21–0.40 = Fair, 0.41–0.60 = Moderate, 0.61–0.80 = Substantial, 0.81–1.00 = Almost Perfect. While convenient, these cutoffs are arbitrary and context-dependent. In systematic review screening, κ ≥ 0.60 is generally considered acceptable.

What is the prevalence paradox?

The prevalence paradox occurs when prevalence of the target category is very high or very low, leading to a high percent agreement but a paradoxically low kappa. This happens because Pe becomes large when the marginal distributions are skewed. PABAK (prevalence-adjusted bias-adjusted kappa) corrects for this by normalizing the prevalence and bias effects: PABAK = 2 × Po – 1.

What alternatives to kappa exist?

Gwet’s AC1 is robust to the prevalence paradox and is recommended when prevalence is extreme. Scott’s pi uses a different chance-agreement model. Krippendorff’s alpha generalizes to multiple raters, multiple categories, and missing data. For ordinal categories, weighted kappa (linear or quadratic weights) accounts for the magnitude of disagreement.

When should I report kappa in a systematic review?

The Cochrane Handbook and PRISMA guidelines recommend reporting inter-rater agreement for both title/abstract screening and full-text screening. You should report kappa (or an alternative) along with percent agreement. If kappa is low, describe the resolution process (e.g., discussion, third reviewer). This helps readers assess the reliability of study selection.

What is a good kappa score for systematic review screening?

For systematic review screening, κ ≥ 0.60 (substantial agreement) is generally considered the minimum acceptable threshold, though κ ≥ 0.80 (almost perfect) is preferred. If kappa falls below 0.60 after pilot screening, reviewers should recalibrate by discussing discrepancies, clarifying eligibility criteria, and re-screening a sample before proceeding with full screening.

Why is my kappa low when percent agreement is high?

This is the kappa paradox (prevalence paradox). When one category dominates — for example, 95% of records are excluded — the expected agreement by chance (Pe) is very high, leaving little room for kappa to exceed zero. PABAK (prevalence-adjusted bias-adjusted kappa) corrects for this effect: PABAK = 2 × Po – 1. Report both kappa and percent agreement for transparency.

What is the difference between Cohen’s kappa and Fleiss’ kappa?

Cohen’s kappa measures agreement between exactly two raters on categorical data. Fleiss’ kappa extends this to three or more raters, where each item is rated by a fixed number of randomly assigned raters. For systematic reviews with more than two screeners, Fleiss’ kappa is the appropriate statistic. Both correct for chance agreement.

Reviewed by

Dr. Sarah Mitchell

PhD, Biostatistics & Research Methodology

Dr. Sarah Mitchell holds a PhD in Biostatistics from Johns Hopkins Bloomberg School of Public Health and has over 15 years of experience in systematic review methodology and meta-analysis. She has authored or co-authored 40+ peer-reviewed publications in journals including the Journal of Clinical Epidemiology, BMC Medical Research Methodology, and Research Synthesis Methods. A former Cochrane Review Group statistician and current editorial board member of Systematic Reviews, Dr. Mitchell has supervised 200+ evidence synthesis projects across clinical medicine, public health, and social sciences. She reviews all Research Gold tools to ensure statistical accuracy and compliance with Cochrane Handbook and PRISMA 2020 standards.

Learn more about our team

Planning Your Review? We Can Take It From Here.

Protocol development, a PROSPERO-ready protocol for you to submit, comprehensive search strategy, screening, analysis, and a publication-ready manuscript. All handled by PhD methodologists.

Our promise: Free rework on search, screening, or synthesis if reviewers push back.

4.9 / 5 across 1,194+ projectsQuote in minutesPRISMA 2020 + Cochrane HandbookPhD methodologistPay only after you approve scopeNDA available on request

Quote my systematic review Chat on WhatsApp

You Shape What We Build Next

How to Use This Tool

Choose Mode

Select 2×2 mode for binary screening or multi-category mode for 3+ classification categories.

Enter Counts

Fill in the agreement matrix cells with the number of items each rater pair classified into each category.

Review Results

See kappa, SE, confidence interval at your selected level (80%, 90%, 95%, or 99%), observed and expected agreement, and the Landis & Koch interpretation.

Copy or Reset

Copy all results to clipboard for your methods section, or reset to start a new calculation.

Key Takeaways for Inter-Rater Agreement

Kappa vs. percent agreement

The prevalence paradox affects kappa

When one category dominates (e.g., 95% excluded), Pe is high and kappa can be low despite high percent agreement. PABAK (2 × Po – 1) adjusts for this effect and is recommended as a supplement.

Landis & Koch benchmarks are guidelines

Use weighted kappa for ordinal categories

When categories have an inherent order (e.g., Low/Medium/High risk of bias), weighted kappa assigns partial credit for near-agreement. Linear or quadratic weights are the most common choices.

Agreement Measures in Systematic Reviews

Frequently Asked Questions

What does Cohen’s kappa measure?

What is the Landis and Koch interpretation scale?

What is the prevalence paradox?

What alternatives to kappa exist?

When should I report kappa in a systematic review?

What is a good kappa score for systematic review screening?

Why is my kappa low when percent agreement is high?

What is the difference between Cohen’s kappa and Fleiss’ kappa?

Planning Your Review? We Can Take It From Here.

Protocol development, a PROSPERO-ready protocol for you to submit, comprehensive search strategy, screening, analysis, and a publication-ready manuscript. All handled by PhD methodologists.

Our promise: Free rework on search, screening, or synthesis if reviewers push back.

4.9 / 5 across 1,194+ projectsQuote in minutesPRISMA 2020 + Cochrane HandbookPhD methodologistPay only after you approve scopeNDA available on request