Calculate intraclass correlation coefficients for inter-rater reliability. Computes all six ICC forms — ICC(1,1), ICC(2,1), ICC(3,1), ICC(1,k), ICC(2,k), and ICC(3,k) — with ANOVA decomposition and 95% confidence intervals.
| Subject | Rater 1 | Rater 2 | Rater 3 |
|---|---|---|---|
| S1 | |||
| S2 | |||
| S3 | |||
| S4 | |||
| S5 | |||
| S6 | |||
| S7 | |||
| S8 | |||
| S9 | |||
| S10 |
Enter numeric ratings for each subject-rater pair. All cells in a row must be filled for that subject to be included. Paste tab- or comma-separated data from a spreadsheet using the Paste Data button.
Specify the number of raters (2–10) and subjects (2–50). The data grid adjusts automatically to match your study design.
Fill in numeric ratings for each subject-rater pair. You can type values directly or use Paste Data to import tab- or comma-separated data from a spreadsheet.
See ICC values, 95% confidence intervals, and Koo & Li (2016) interpretation badges for all six ICC forms. ICC(2,1) and ICC(3,1) are highlighted as the most commonly used.
Copy the complete results including ANOVA components (MSB, MSJ, MSE) to your clipboard for your reliability analysis report or manuscript.
ICC(1), ICC(2), and ICC(3) differ fundamentally in their assumptions about raters. One-way models (ICC 1) assume different raters per subject. Two-way random (ICC 2) assumes raters are a random sample from a population. Two-way mixed (ICC 3) treats raters as fixed. Choosing the wrong model can substantially inflate or deflate your reliability estimate.
Consistency (ICC 3) only cares whether raters rank subjects similarly, while absolute agreement (ICC 2) also penalizes systematic rater differences. If Rater A always scores 2 points higher than Rater B, consistency ICC will be high but agreement ICC will be lower. Use absolute agreement when interchangeability of raters matters.
The most widely cited interpretation framework classifies ICC < 0.50 as poor, 0.50–0.75 as moderate, 0.75–0.90 as good, and > 0.90 as excellent. These benchmarks apply to single-measure forms. Always report the 95% CI: a point estimate of 0.85 with a CI of [0.40, 0.95] indicates poor precision despite a nominally good value.
State the ICC form used (e.g., ICC(3,1) for single-measure consistency), the number of raters and subjects, the point estimate, the 95% confidence interval, and the interpretation framework. Also describe rater training, rating conditions, and any blinding procedures. Guidelines like GRRAS (Kottner et al., 2011) provide a comprehensive reporting checklist.
An ICC calculator online computes the intraclass correlation coefficient, the standard measure of agreement between two or more raters on continuous or ordinal measurements. Unlike Pearson's correlation, which only assesses the linear relationship between two variables, the ICC evaluates both consistency and absolute agreement — accounting for systematic differences between raters. Shrout and Fleiss (1979) defined six ICC forms that remain the foundation of reliability research: ICC(1,1) for one-way random single measures, ICC(2,1) for two-way random single measures, ICC(3,1) for two-way mixed single measures, and their averaged counterparts ICC(1,k), ICC(2,k), and ICC(3,k). Choosing the correct form depends on whether raters are considered random or fixed effects and whether you need single-measure or average-measure reliability. ICC computation is available in SPSS (via the Reliability Analysis module) and in R through the irr package, both of which serve as common reference implementations for validating results. For ordinal categorical data where ICC is not appropriate, weighted kappa serves as an analogous agreement measure that assigns partial credit for near-agreement between ordered categories.
This inter-rater reliability calculator is essential for systematic review methodology. The Cochrane Handbook (Higgins et al., 2023, Chapter 4) requires that systematic reviews report inter-rater agreement for both study screening and data extraction. When two reviewers independently extract continuous data — such as means, standard deviations, sample sizes, or quality scores — the ICC quantifies the degree of agreement between their extracted values. An ICC above 0.75 indicates good reliability, while values above 0.90 indicate excellent reliability (Koo & Li, 2016). Values below 0.50 suggest poor agreement that requires additional calibration training before proceeding with extraction. The GRRAS reporting guideline (Kottner et al., 2011) provides a comprehensive checklist for reliability studies, specifying that authors must report the ICC form, the number of raters and subjects, rater training procedures, and blinding conditions. Walter et al. (1998) recommend a minimum of 30 subjects for stable ICC estimates with acceptably narrow confidence intervals.
The intraclass correlation calculator in this tool computes ICC from a rater-by-subject data matrix using ANOVA decomposition. The between-subjects mean square (MSᵦ), within-subjects mean square (MSw), and between-raters mean square (MSᵣ) partition the total variance into components attributable to true differences between subjects, random measurement error, and systematic rater bias. ICC(2,1) — the most commonly reported form in medical research — accounts for both random error and systematic rater effects, making it the recommended choice when raters are a random sample from a larger population of potential raters. Bland–Altman plots complement ICC by providing a visual assessment of agreement: plotting the difference between two raters against their mean reveals systematic bias and identifies whether disagreement varies across the measurement range, information that a single ICC value cannot capture.
In practice, ICC serves multiple purposes across the systematic review workflow. During pilot testing of the data extraction form, ICC identifies fields where reviewers disagree — poor agreement on a particular variable may indicate ambiguous extraction instructions. For categorical screening decisions (include/exclude/uncertain), use our Cohen's kappa calculator instead, as kappa is designed for nominal classifications. When extracting numerical study data, structure your form using our data extraction template builder to standardize variable definitions. For planning how many subjects to include in a reliability study, our power analysis calculator estimates the sample size needed to detect a target ICC with adequate precision. Report your ICC alongside its 95% confidence interval in your methods section, specifying the ICC form used and the interpretation framework applied.
The ICC quantifies the degree of agreement or consistency among measurements made by multiple raters (or instruments) on the same set of subjects. Unlike Pearson correlation, which measures linear association between two variables, ICC accounts for both the correlation and the agreement between raters. ICC values range from 0 (no reliability) to 1 (perfect reliability), though negative values are theoretically possible when between-subjects variability is lower than within-subjects variability.
The choice depends on your study design and research question. Use ICC(1,1) or ICC(1,k) when each subject is rated by a different random set of raters (one-way random model). Use ICC(2,1) or ICC(2,k) when the same random raters rate all subjects and you need absolute agreement (two-way random model). Use ICC(3,1) or ICC(3,k) when the same fixed raters rate all subjects and you only need consistency (two-way mixed model). Use single-measure forms (1) when reporting individual rater reliability, and average-measure forms (k) when the final measurement will be the mean of k raters.
ICC is designed for continuous or ordinal data with multiple raters, while Cohen’s kappa is designed for categorical (nominal) data between exactly two raters. ICC uses an ANOVA-based framework to partition variance, while kappa uses observed vs. expected agreement proportions. For ordinal data, weighted kappa and ICC may give similar results, but ICC is generally preferred when ratings are on a continuous or interval scale. For multi-rater categorical data, Fleiss’ kappa extends Cohen’s kappa.
Koo and Li (2016) proposed widely cited guidelines: ICC < 0.50 indicates poor reliability, 0.50–0.75 indicates moderate reliability, 0.75–0.90 indicates good reliability, and > 0.90 indicates excellent reliability. However, these are general benchmarks. The acceptable ICC depends on your field and purpose. Diagnostic instruments typically require ICC > 0.90, while screening tools may accept ICC > 0.70. Always report the 95% confidence interval alongside the point estimate.
As a general guideline, at least 30 subjects and 3 raters are recommended for stable ICC estimates, though more is always better. With fewer than 10 subjects, ICC confidence intervals become very wide and the estimate is unreliable. The number of raters affects precision: more raters reduce the width of the confidence interval. For pilot studies, 15–20 subjects with 2–3 raters may suffice, but for definitive reliability studies, aim for 50+ subjects. Walter, Eliasziw, and Donner (1998) provide formal sample size formulas for ICC studies.
Cronbach’s alpha measures internal consistency (how well items on a scale measure the same construct), while ICC measures agreement between different raters or instruments measuring the same subjects. ICC(3,k) for consistency is mathematically identical to Cronbach’s alpha, but the interpretations differ: alpha assesses scale reliability, ICC assesses measurement reliability.
For most inter-rater reliability studies, use ICC(2,1) if raters are a random sample from a larger population and you need absolute agreement, or ICC(3,1) if the same fixed raters rate all subjects and you only need consistency. Use the averaged forms (ICC(2,k) or ICC(3,k)) only when the final measurement will be the mean of k raters.
Report the ICC form used (e.g., ICC(2,1), two-way random, absolute agreement), the number of raters and subjects, the point estimate, and the 95% confidence interval. Example: “Inter-rater reliability was good, ICC(2,1) = 0.82, 95% CI [0.71, 0.89], based on 3 raters and 40 subjects.” Follow the GRRAS guideline (Kottner et al., 2011) for complete reporting.
Our biostatisticians can conduct full inter-rater reliability analyses, design calibration exercises, compute weighted kappa and ICC for your screening team, and ensure your systematic review meets PRISMA reporting standards.