Calculate inter-rater agreement for screening decisions in your systematic review. Enter a 2×2 agreement matrix or a multi-category matrix to get Cohen's kappa, standard error, 95% CI, and interpretation based on the Landis & Koch scale.
Enter the number of items classified by each rater into the 2×2 agreement matrix. Pre-filled with a screening example (N = 100).
κ = (Po − Pe) / (1 − Pe). Po = (a + d) / N. Pe = ((a+b)(a+c) + (c+d)(b+d)) / N².
Select 2×2 mode for binary screening or multi-category mode for 3+ classification categories.
Fill in the agreement matrix cells with the number of items each rater pair classified into each category.
See kappa, SE, 95% CI, observed and expected agreement, and the Landis & Koch interpretation.
Copy all results to clipboard for your methods section, or reset to start a new calculation.
Percent agreement (Po) can be misleadingly high when prevalence is extreme. Kappa corrects for chance agreement and provides a more meaningful measure. Always report both Po and kappa in your methods section.
When one category dominates (e.g., 95% excluded), Pe is high and kappa can be low despite high percent agreement. PABAK (2 × Po – 1) adjusts for this effect and is recommended as a supplement.
The Landis & Koch scale is widely used but arbitrary. In clinical screening contexts, kappa ≥ 0.60 is often the minimum acceptable threshold. Context and consequences of disagreement should inform your threshold.
When categories have an inherent order (e.g., Low/Medium/High risk of bias), weighted kappa assigns partial credit for near-agreement. Linear or quadratic weights are the most common choices.
A Cohen's kappa calculator measures the degree of agreement between two raters on categorical classifications, correcting for the proportion of agreement that would occur by chance alone. Cohen (1960) introduced kappa as a reliability coefficient for nominal scales, and it has since become the standard metric for reporting inter-rater agreement in systematic reviews. The Cochrane Handbook for Systematic Reviews of Interventions (Higgins et al., 2023, Chapter 4) requires that reviewers report agreement statistics for both the screening phase and risk of bias assessment, because independent dual review is a cornerstone of systematic review methodology. When prevalence is extreme and the kappa paradox deflates the statistic, Gwet's AC1 provides a prevalence-robust alternative that yields more stable estimates of true agreement. For reviews involving three or more screeners, Fleiss' kappa extends the two-rater framework to multiple raters, while Krippendorff's alpha generalises further by accommodating any number of raters, multiple category types, and incomplete (missing) data.
This kappa statistic tool computes observed agreement (Pₒ), expected agreement by chance (Pₑ), and kappa as (Pₒ − Pₑ) / (1 − Pₑ). Interpretation follows the Landis and Koch (1977) scale: kappa below 0.20 indicates slight agreement, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement. However, kappa is sensitive to prevalence and bias: when one category dominates (e.g., 95% of screened records are excluded), kappa can be low despite high observed agreement — a phenomenon known as the kappa paradox (Feinstein & Cicchetti, 1990). For this reason, systematic review reporting should include both kappa and the observed agreement percentage alongside the prevalence index.
In the systematic review workflow, the screening agreement calculator is applied at multiple stages. During title-and-abstract screening, two reviewers independently classify records as "include," "exclude," or "uncertain" — producing a multi-category agreement matrix that this tool handles natively. During full-text eligibility assessment, the same kappa computation quantifies agreement on the final inclusion decision. PRISMA 2020 (Page et al., 2021) expects transparency about the agreement process, including the kappa value, the number of discrepancies, and how disagreements were resolved (discussion, third reviewer adjudication, or consensus). Screening platforms such as Covidence and Rayyan include built-in agreement reporting that automatically computes kappa after dual-independent screening is completed. Conducting calibration exercises and pilot screening on a subset of records before full screening begins is strongly recommended to identify ambiguous criteria and improve agreement before disagreements accumulate at scale.
Beyond screening, kappa is essential for assessing agreement on risk of bias judgments. When two reviewers independently evaluate each study using our RoB 2 assessment tool or our ROBINS-I tool for non-randomized studies, the domain-level judgments (low risk, some concerns, high risk) form an ordinal agreement matrix that kappa can quantify. For continuous-scale reliability — such as agreement on quality scores or extraction of numerical data — the intraclass correlation coefficient (ICC) calculator is the appropriate complement. Together, kappa and ICC provide a complete picture of inter-rater reliability across the categorical and continuous components of a systematic review. When building your extraction form, our data extraction template builder includes built-in fields for recording agreement statistics.
Cohen’s kappa (κ) measures the agreement between two raters on categorical classifications, correcting for agreement that would occur by chance alone. Unlike percent agreement (Po), kappa accounts for the expected agreement (Pe) under the assumption of independence. A kappa of 1 means perfect agreement, 0 means agreement equal to chance, and negative values mean agreement worse than chance.
Landis and Koch (1977) proposed a widely used benchmark: κ < 0 = Poor, 0–0.20 = Slight, 0.21–0.40 = Fair, 0.41–0.60 = Moderate, 0.61–0.80 = Substantial, 0.81–1.00 = Almost Perfect. While convenient, these cutoffs are arbitrary and context-dependent. In systematic review screening, κ ≥ 0.60 is generally considered acceptable.
The prevalence paradox occurs when prevalence of the target category is very high or very low, leading to a high percent agreement but a paradoxically low kappa. This happens because Pe becomes large when the marginal distributions are skewed. PABAK (prevalence-adjusted bias-adjusted kappa) corrects for this by normalizing the prevalence and bias effects: PABAK = 2 × Po – 1.
Gwet’s AC1 is robust to the prevalence paradox and is recommended when prevalence is extreme. Scott’s pi uses a different chance-agreement model. Krippendorff’s alpha generalizes to multiple raters, multiple categories, and missing data. For ordinal categories, weighted kappa (linear or quadratic weights) accounts for the magnitude of disagreement.
The Cochrane Handbook and PRISMA guidelines recommend reporting inter-rater agreement for both title/abstract screening and full-text screening. You should report kappa (or an alternative) along with percent agreement. If kappa is low, describe the resolution process (e.g., discussion, third reviewer). This helps readers assess the reliability of study selection.
For systematic review screening, κ ≥ 0.60 (substantial agreement) is generally considered the minimum acceptable threshold, though κ ≥ 0.80 (almost perfect) is preferred. If kappa falls below 0.60 after pilot screening, reviewers should recalibrate by discussing discrepancies, clarifying eligibility criteria, and re-screening a sample before proceeding with full screening.
This is the kappa paradox (prevalence paradox). When one category dominates — for example, 95% of records are excluded — the expected agreement by chance (Pe) is very high, leaving little room for kappa to exceed zero. PABAK (prevalence-adjusted bias-adjusted kappa) corrects for this effect: PABAK = 2 × Po – 1. Report both kappa and percent agreement for transparency.
Cohen’s kappa measures agreement between exactly two raters on categorical data. Fleiss’ kappa extends this to three or more raters, where each item is rated by a fixed number of randomly assigned raters. For systematic reviews with more than two screeners, Fleiss’ kappa is the appropriate statistic. Both correct for chance agreement.
Our experienced reviewers provide dual-independent screening with documented agreement rates for your systematic review, following Cochrane and PRISMA best practices.