Research Gold
ServicesPricingHow It WorksFree ToolsSamplesAboutFAQ
LoginGet Started
Research Gold

Professional evidence synthesis support for researchers, clinicians, and academic institutions worldwide.

6801 Gaylord Pkwy
Frisco, TX 75034, USA

Company

  • About
  • Blog
  • Careers

Services

  • Systematic Review
  • Scoping Review
  • Meta-Analysis
  • Pricing

Resources

  • PRISMA Guide
  • Samples
  • FAQ
  • How It Works

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • NDA Agreement

© 2026 Research Gold. All rights reserved.

PrivacyTerms
All Resources

Cohen's Kappa Calculator

Free

Calculate inter-rater agreement for screening decisions in your systematic review. Enter a 2×2 agreement matrix or a multi-category matrix to get Cohen's kappa, standard error, 95% CI, and interpretation based on the Landis & Koch scale.

Enter the number of items classified by each rater into the 2×2 agreement matrix. Pre-filled with a screening example (N = 100).

Rater 2 Include
Rater 2 Exclude
Rater 1 Include
Rater 1 Exclude

κ = (Po − Pe) / (1 − Pe). Po = (a + d) / N. Pe = ((a+b)(a+c) + (c+d)(b+d)) / N².

Results
Cohen's kappa (\u03BA)0.7000
SE(\u03BA)0.0711
95% CI[0.5607, 0.8393]
Observed agreement (Po)0.8500
Expected agreement (Pe)0.5000
Total N100
Prevalence Index0.0500
Bias Index0.0500
PABAK0.7000
Interpretation:
Substantial agreement

How to Use This Tool

1

Choose Mode

Select 2×2 mode for binary screening or multi-category mode for 3+ classification categories.

2

Enter Counts

Fill in the agreement matrix cells with the number of items each rater pair classified into each category.

3

Review Results

See kappa, SE, 95% CI, observed and expected agreement, and the Landis & Koch interpretation.

4

Copy or Reset

Copy all results to clipboard for your methods section, or reset to start a new calculation.

Key Takeaways for Inter-Rater Agreement

Kappa vs. percent agreement

Percent agreement (Po) can be misleadingly high when prevalence is extreme. Kappa corrects for chance agreement and provides a more meaningful measure. Always report both Po and kappa in your methods section.

The prevalence paradox affects kappa

When one category dominates (e.g., 95% excluded), Pe is high and kappa can be low despite high percent agreement. PABAK (2 × Po – 1) adjusts for this effect and is recommended as a supplement.

Landis & Koch benchmarks are guidelines

The Landis & Koch scale is widely used but arbitrary. In clinical screening contexts, kappa ≥ 0.60 is often the minimum acceptable threshold. Context and consequences of disagreement should inform your threshold.

Use weighted kappa for ordinal categories

When categories have an inherent order (e.g., Low/Medium/High risk of bias), weighted kappa assigns partial credit for near-agreement. Linear or quadratic weights are the most common choices.

Agreement Measures in Systematic Reviews

A Cohen's kappa calculator measures the degree of agreement between two raters on categorical classifications, correcting for the proportion of agreement that would occur by chance alone. Cohen (1960) introduced kappa as a reliability coefficient for nominal scales, and it has since become the standard metric for reporting inter-rater agreement in systematic reviews. The Cochrane Handbook for Systematic Reviews of Interventions (Higgins et al., 2023, Chapter 4) requires that reviewers report agreement statistics for both the screening phase and risk of bias assessment, because independent dual review is a cornerstone of systematic review methodology. When prevalence is extreme and the kappa paradox deflates the statistic, Gwet's AC1 provides a prevalence-robust alternative that yields more stable estimates of true agreement. For reviews involving three or more screeners, Fleiss' kappa extends the two-rater framework to multiple raters, while Krippendorff's alpha generalises further by accommodating any number of raters, multiple category types, and incomplete (missing) data.

This kappa statistic tool computes observed agreement (Pₒ), expected agreement by chance (Pₑ), and kappa as (Pₒ − Pₑ) / (1 − Pₑ). Interpretation follows the Landis and Koch (1977) scale: kappa below 0.20 indicates slight agreement, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement. However, kappa is sensitive to prevalence and bias: when one category dominates (e.g., 95% of screened records are excluded), kappa can be low despite high observed agreement — a phenomenon known as the kappa paradox (Feinstein & Cicchetti, 1990). For this reason, systematic review reporting should include both kappa and the observed agreement percentage alongside the prevalence index.

In the systematic review workflow, the screening agreement calculator is applied at multiple stages. During title-and-abstract screening, two reviewers independently classify records as "include," "exclude," or "uncertain" — producing a multi-category agreement matrix that this tool handles natively. During full-text eligibility assessment, the same kappa computation quantifies agreement on the final inclusion decision. PRISMA 2020 (Page et al., 2021) expects transparency about the agreement process, including the kappa value, the number of discrepancies, and how disagreements were resolved (discussion, third reviewer adjudication, or consensus). Screening platforms such as Covidence and Rayyan include built-in agreement reporting that automatically computes kappa after dual-independent screening is completed. Conducting calibration exercises and pilot screening on a subset of records before full screening begins is strongly recommended to identify ambiguous criteria and improve agreement before disagreements accumulate at scale.

Beyond screening, kappa is essential for assessing agreement on risk of bias judgments. When two reviewers independently evaluate each study using our RoB 2 assessment tool or our ROBINS-I tool for non-randomized studies, the domain-level judgments (low risk, some concerns, high risk) form an ordinal agreement matrix that kappa can quantify. For continuous-scale reliability — such as agreement on quality scores or extraction of numerical data — the intraclass correlation coefficient (ICC) calculator is the appropriate complement. Together, kappa and ICC provide a complete picture of inter-rater reliability across the categorical and continuous components of a systematic review. When building your extraction form, our data extraction template builder includes built-in fields for recording agreement statistics.

Frequently Asked Questions

What does Cohen’s kappa measure?

Cohen’s kappa (κ) measures the agreement between two raters on categorical classifications, correcting for agreement that would occur by chance alone. Unlike percent agreement (Po), kappa accounts for the expected agreement (Pe) under the assumption of independence. A kappa of 1 means perfect agreement, 0 means agreement equal to chance, and negative values mean agreement worse than chance.

What is the Landis and Koch interpretation scale?

Landis and Koch (1977) proposed a widely used benchmark: κ < 0 = Poor, 0–0.20 = Slight, 0.21–0.40 = Fair, 0.41–0.60 = Moderate, 0.61–0.80 = Substantial, 0.81–1.00 = Almost Perfect. While convenient, these cutoffs are arbitrary and context-dependent. In systematic review screening, κ ≥ 0.60 is generally considered acceptable.

What is the prevalence paradox?

The prevalence paradox occurs when prevalence of the target category is very high or very low, leading to a high percent agreement but a paradoxically low kappa. This happens because Pe becomes large when the marginal distributions are skewed. PABAK (prevalence-adjusted bias-adjusted kappa) corrects for this by normalizing the prevalence and bias effects: PABAK = 2 × Po – 1.

What alternatives to kappa exist?

Gwet’s AC1 is robust to the prevalence paradox and is recommended when prevalence is extreme. Scott’s pi uses a different chance-agreement model. Krippendorff’s alpha generalizes to multiple raters, multiple categories, and missing data. For ordinal categories, weighted kappa (linear or quadratic weights) accounts for the magnitude of disagreement.

When should I report kappa in a systematic review?

The Cochrane Handbook and PRISMA guidelines recommend reporting inter-rater agreement for both title/abstract screening and full-text screening. You should report kappa (or an alternative) along with percent agreement. If kappa is low, describe the resolution process (e.g., discussion, third reviewer). This helps readers assess the reliability of study selection.

What is a good kappa score for systematic review screening?

For systematic review screening, κ ≥ 0.60 (substantial agreement) is generally considered the minimum acceptable threshold, though κ ≥ 0.80 (almost perfect) is preferred. If kappa falls below 0.60 after pilot screening, reviewers should recalibrate by discussing discrepancies, clarifying eligibility criteria, and re-screening a sample before proceeding with full screening.

Why is my kappa low when percent agreement is high?

This is the kappa paradox (prevalence paradox). When one category dominates — for example, 95% of records are excluded — the expected agreement by chance (Pe) is very high, leaving little room for kappa to exceed zero. PABAK (prevalence-adjusted bias-adjusted kappa) corrects for this effect: PABAK = 2 × Po – 1. Report both kappa and percent agreement for transparency.

What is the difference between Cohen’s kappa and Fleiss’ kappa?

Cohen’s kappa measures agreement between exactly two raters on categorical data. Fleiss’ kappa extends this to three or more raters, where each item is rated by a fixed number of randomly assigned raters. For systematic reviews with more than two screeners, Fleiss’ kappa is the appropriate statistic. Both correct for chance agreement.

Need Expert Dual Screening?

Our experienced reviewers provide dual-independent screening with documented agreement rates for your systematic review, following Cochrane and PRISMA best practices.

Explore Services View Pricing