ICC Calculator

Free

Calculate intraclass correlation coefficients for inter-rater reliability. Computes all six ICC forms: ICC(1,1), ICC(2,1), ICC(3,1), ICC(1,k), ICC(2,k), and ICC(3,k), with ANOVA decomposition and 95% confidence intervals.

Number of raters

Number of subjects

Confidence level

Upload a CSV/Excel file or paste data from a spreadsheet. Rows = subjects, columns = raters. Headers are auto-detected and skipped.

Subject	Rater 1	Rater 2	Rater 3
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10

Enter numeric ratings for each subject-rater pair. All cells in a row must be filled for that subject to be included. Paste tab- or comma-separated data from a spreadsheet using the Paste Data button.

Fill in ratings for at least 2 subjects and 2 raters to see results. Use the Example button to load sample data.

Next step

Got the ICC. Want the full inter-rater reliability analysis?

ICC selection by design, agreement vs consistency, Bland-Altman, and a publication-ready methods section.

Our promise: Free re-run and re-write if reviewers question the analysis or reporting.

Quote in minutesPay only after you approve your quotePhD methodologistReproducible R or Stata codeNDA available on request

Quote my statistical analysis WhatsApp

Timeline

Most projects deliver in under 2 weeks. We confirm an exact date in your quote.

If reviewers push back

If reviewers question the analysis, assumptions, or reporting, we re-run and re-write free.

Confidentiality

NDA available on request before any project discussion. Your data, study design, and manuscript stay private either way.

How to Use This Tool

Set Dimensions

Specify the number of raters (2–10) and subjects (2–50). The data grid adjusts automatically to match your study design.

Enter Ratings

Fill in numeric ratings for each subject-rater pair. You can type values directly or use Paste Data to import tab- or comma-separated data from a spreadsheet.

Review All 6 Forms

See ICC values, 95% confidence intervals, and Koo & Li (2016) interpretation badges for all six ICC forms. ICC(2,1) and ICC(3,1) are highlighted as the most commonly used.

Copy & Report

Copy the complete results including ANOVA components (MSB, MSJ, MSE) to your clipboard for your reliability analysis report or manuscript.

Want a PhD methodologist to handle the whole project?

Get a complete inter-rater reliability analysis for your systematic review. Free re-run and re-write if reviewers question the analysis or reporting. Pay only after you approve your quote.

WhatsApp Quote my statistical analysis

Key Takeaways for Reliability Studies

Model selection matters

ICC(1), ICC(2), and ICC(3) differ fundamentally in their assumptions about raters. One-way models (ICC 1) assume different raters per subject. Two-way random (ICC 2) assumes raters are a random sample from a population. Two-way mixed (ICC 3) treats raters as fixed. Choosing the wrong model can substantially inflate or deflate your reliability estimate.

Consistency vs. absolute agreement

Consistency (ICC 3) only cares whether raters rank subjects similarly, while absolute agreement (ICC 2) also penalizes systematic rater differences. If Rater A always scores 2 points higher than Rater B, consistency ICC will be high but agreement ICC will be lower. Use absolute agreement when interchangeability of raters matters.

Interpret with Koo & Li (2016) benchmarks

The most widely cited interpretation framework classifies ICC < 0.50 as poor, 0.50–0.75 as moderate, 0.75–0.90 as good, and > 0.90 as excellent. These benchmarks apply to single-measure forms. Always report the 95% CI: a point estimate of 0.85 with a CI of [0.40, 0.95] indicates poor precision despite a nominally good value.

Report all necessary details

State the ICC form used (e.g., ICC(3,1) for single-measure consistency), the number of raters and subjects, the point estimate, the 95% confidence interval, and the interpretation framework. Also describe rater training, rating conditions, and any blinding procedures. Guidelines like GRRAS (Kottner et al., 2011) provide a comprehensive reporting checklist.

ICC in Systematic Reviews and Reliability Research

An ICC calculator online computes the intraclass correlation coefficient, the standard measure of agreement between two or more raters on continuous or ordinal measurements. Unlike Pearson's correlation, which only assesses the linear relationship between two variables, the ICC evaluates both consistency and absolute agreement, accounting for systematic differences between raters. Shrout and Fleiss (1979) defined six ICC forms that remain the foundation of reliability research: ICC(1,1) for one-way random single measures, ICC(2,1) for two-way random single measures, ICC(3,1) for two-way mixed single measures, and their averaged counterparts ICC(1,k), ICC(2,k), and ICC(3,k). Choosing the correct form depends on whether raters are considered random or fixed effects and whether you need single-measure or average-measure reliability. ICC computation is available in SPSS (via the Reliability Analysis module) and in R through the irr package, both of which serve as common reference implementations for validating results. For ordinal categorical data where ICC is not appropriate, weighted kappa serves as an analogous agreement measure that assigns partial credit for near-agreement between ordered categories.

This inter-rater reliability calculator is essential for systematic review methodology. The Cochrane Handbook (Higgins et al., 2023, Chapter 4) requires that systematic reviews report inter-rater agreement for both study screening and data extraction. When two reviewers independently extract continuous data, such as means, standard deviations, sample sizes, or quality scores, the ICC quantifies the degree of agreement between their extracted values. An ICC above 0.75 indicates good reliability, while values above 0.90 indicate excellent reliability (Koo & Li, 2016). Values below 0.50 suggest poor agreement that requires additional calibration training before proceeding with extraction. The GRRAS reporting guideline (Kottner et al., 2011) provides a comprehensive checklist for reliability studies, specifying that authors must report the ICC form, the number of raters and subjects, rater training procedures, and blinding conditions. Walter et al. (1998) recommend a minimum of 30 subjects for stable ICC estimates with acceptably narrow confidence intervals.

The intraclass correlation calculator in this tool computes ICC from a rater-by-subject data matrix using ANOVA decomposition. The between-subjects mean square (MSᵦ), within-subjects mean square (MSw), and between-raters mean square (MSᵣ) partition the total variance into components attributable to true differences between subjects, random measurement error, and systematic rater bias. ICC(2,1), the most commonly reported form in medical research, accounts for both random error and systematic rater effects, making it the recommended choice when raters are a random sample from a larger population of potential raters. Bland–Altman plots complement ICC by providing a visual assessment of agreement: plotting the difference between two raters against their mean reveals systematic bias and identifies whether disagreement varies across the measurement range, information that a single ICC value cannot capture.

In practice, ICC serves multiple purposes across the systematic review workflow. During pilot testing of the data extraction form, ICC identifies fields where reviewers disagree. Poor agreement on a particular variable may indicate ambiguous extraction instructions. For categorical screening decisions (include/exclude/uncertain), use our Cohen's kappa calculator instead, as kappa is designed for nominal classifications. When extracting numerical study data, structure your form using our data extraction template builder to standardize variable definitions. For planning how many subjects to include in a reliability study, our power analysis calculator estimates the sample size needed to detect a target ICC with adequate precision. Report your ICC alongside its 95% confidence interval in your methods section, specifying the ICC form used and the interpretation framework applied.

Frequently Asked Questions

What does the intraclass correlation coefficient (ICC) measure?

The ICC quantifies the degree of agreement or consistency among measurements made by multiple raters (or instruments) on the same set of subjects. Unlike Pearson correlation, which measures linear association between two variables, ICC accounts for both the correlation and the agreement between raters. ICC values range from 0 (no reliability) to 1 (perfect reliability), though negative values are theoretically possible when between-subjects variability is lower than within-subjects variability.

How do I choose the right ICC form?

The choice depends on your study design and research question. Use ICC(1,1) or ICC(1,k) when each subject is rated by a different random set of raters (one-way random model). Use ICC(2,1) or ICC(2,k) when the same random raters rate all subjects and you need absolute agreement (two-way random model). Use ICC(3,1) or ICC(3,k) when the same fixed raters rate all subjects and you only need consistency (two-way mixed model). Use single-measure forms (1) when reporting individual rater reliability, and average-measure forms (k) when the final measurement will be the mean of k raters.

What is the difference between ICC and Cohen's kappa?

ICC is designed for continuous or ordinal data with multiple raters, while Cohen’s kappa is designed for categorical (nominal) data between exactly two raters. ICC uses an ANOVA-based framework to partition variance, while kappa uses observed vs. expected agreement proportions. For ordinal data, weighted kappa and ICC may give similar results, but ICC is generally preferred when ratings are on a continuous or interval scale. For multi-rater categorical data, Fleiss’ kappa extends Cohen’s kappa.

What are the benchmarks for interpreting ICC values?

Koo and Li (2016) proposed widely cited guidelines: ICC < 0.50 indicates poor reliability, 0.50–0.75 indicates moderate reliability, 0.75–0.90 indicates good reliability, and > 0.90 indicates excellent reliability. However, these are general benchmarks. The acceptable ICC depends on your field and purpose. Diagnostic instruments typically require ICC > 0.90, while screening tools may accept ICC > 0.70. Always report the 95% confidence interval alongside the point estimate.

How many raters and subjects do I need for a reliable ICC estimate?

As a general guideline, at least 30 subjects and 3 raters are recommended for stable ICC estimates, though more is always better. With fewer than 10 subjects, ICC confidence intervals become very wide and the estimate is unreliable. The number of raters affects precision: more raters reduce the width of the confidence interval. For pilot studies, 15–20 subjects with 2–3 raters may suffice, but for definitive reliability studies, aim for 50+ subjects. Walter, Eliasziw, and Donner (1998) provide formal sample size formulas for ICC studies.

What is the difference between ICC and Cronbach’s alpha?

Cronbach’s alpha measures internal consistency (how well items on a scale measure the same construct), while ICC measures agreement between different raters or instruments measuring the same subjects. ICC(3,k) for consistency is mathematically identical to Cronbach’s alpha, but the interpretations differ: alpha assesses scale reliability, ICC assesses measurement reliability.

Which ICC form should I use for inter-rater reliability?

For most inter-rater reliability studies, use ICC(2,1) if raters are a random sample from a larger population and you need absolute agreement, or ICC(3,1) if the same fixed raters rate all subjects and you only need consistency. Use the averaged forms (ICC(2,k) or ICC(3,k)) only when the final measurement will be the mean of k raters.

How do I report ICC in a manuscript?

Report the ICC form used (e.g., ICC(2,1), two-way random, absolute agreement), the number of raters and subjects, the point estimate, and the 95% confidence interval. Example: “Inter-rater reliability was good, ICC(2,1) = 0.82, 95% CI [0.71, 0.89], based on 3 raters and 40 subjects.” Follow the GRRAS guideline (Kottner et al., 2011) for complete reporting.

Reviewed by

Dr. Sarah Mitchell

PhD, Biostatistics & Research Methodology

Dr. Sarah Mitchell holds a PhD in Biostatistics from Johns Hopkins Bloomberg School of Public Health and has over 15 years of experience in systematic review methodology and meta-analysis. She has authored or co-authored 40+ peer-reviewed publications in journals including the Journal of Clinical Epidemiology, BMC Medical Research Methodology, and Research Synthesis Methods. A former Cochrane Review Group statistician and current editorial board member of Systematic Reviews, Dr. Mitchell has supervised 200+ evidence synthesis projects across clinical medicine, public health, and social sciences. She reviews all Research Gold tools to ensure statistical accuracy and compliance with Cochrane Handbook and PRISMA 2020 standards.

Learn more about our team

Need a Statistician? Our PhD Team Is Standing By.

From data cleaning and assumption checks to the full analysis and a publication-ready results section, we handle the numbers so you can focus on the science.

Our promise: Free re-run and re-write if reviewers question the analysis or reporting.

4.9 / 5 across 1,194+ projectsQuote in minutesReproducible R or Stata codePhD methodologistPay only after you approve your quoteNDA available on request

Quote my statistical analysis Chat on WhatsApp

The methodologists behind your review

Your project is led by a named PhD methodologist with real credentials and published work.

4.9 / 5 across 1,194+ delivered projects

Meet our methodologists

Wei Cheng, PhD

Network Meta-Analysis

Eva Culakova, PhD

Clinical Trials

Belinda Burford, PhD

GRADE

Shelley Strowman, PhD

Nursing / DNP

Jenny Berrio, MD, PhD

Meta-Analysis

You Shape What We Build Next

Subject

Rater 1

Rater 2

Rater 3

S10

How to Use This Tool

Set Dimensions

Specify the number of raters (2–10) and subjects (2–50). The data grid adjusts automatically to match your study design.

Enter Ratings

Fill in numeric ratings for each subject-rater pair. You can type values directly or use Paste Data to import tab- or comma-separated data from a spreadsheet.

Review All 6 Forms

See ICC values, 95% confidence intervals, and Koo & Li (2016) interpretation badges for all six ICC forms. ICC(2,1) and ICC(3,1) are highlighted as the most commonly used.

Copy & Report

Copy the complete results including ANOVA components (MSB, MSJ, MSE) to your clipboard for your reliability analysis report or manuscript.

Key Takeaways for Reliability Studies

Model selection matters

Consistency vs. absolute agreement

Interpret with Koo & Li (2016) benchmarks

Report all necessary details

ICC in Systematic Reviews and Reliability Research

Frequently Asked Questions

What does the intraclass correlation coefficient (ICC) measure?

How do I choose the right ICC form?

What is the difference between ICC and Cohen's kappa?

What are the benchmarks for interpreting ICC values?

How many raters and subjects do I need for a reliable ICC estimate?

What is the difference between ICC and Cronbach’s alpha?

Which ICC form should I use for inter-rater reliability?

How do I report ICC in a manuscript?

Need a Statistician? Our PhD Team Is Standing By.

From data cleaning and assumption checks to the full analysis and a publication-ready results section, we handle the numbers so you can focus on the science.

Our promise: Free re-run and re-write if reviewers question the analysis or reporting.

4.9 / 5 across 1,194+ projectsQuote in minutesReproducible R or Stata codePhD methodologistPay only after you approve your quoteNDA available on request