Intraclass Correlation Coefficient: Models, Software, Guide

PhD-Led Research Services

Free personalized quote in 2 min

The intraclass correlation coefficient measures the proportion of total variance in a measurement that is attributable to differences between subjects, used to quantify agreement or reliability among raters or repeated measurements on a continuous outcome. ICC is the right statistic when two or more raters score the same subjects on a continuous scale, or when one rater scores the same subject more than once, and the question is how much of the variance reflects real differences between subjects versus measurement noise. Unlike Pearson correlation, which captures linear association regardless of agreement, the intraclass correlation coefficient distinguishes systematic rater bias from random error and produces a single coefficient in which 0 means no reliability and 1 means perfect reliability.

Featured diagram of the intraclass correlation coefficient as a variance-decomposition framework, with a stacked-bar visualization on the left showing total variance partitioned into between-subjects variance which contributes to ICC and within-subjects variance which does not contribute to ICC, plus a six-model matrix on the right showing the Shrout and Fleiss 1979 framework with columns for one-way random two-way random and two-way mixed and rows for single-measurement and average-measurement reliability and the notation ICC of model k underscore p where k is the model number and p is one for single-measure or k for average-measure, and a callout below explaining that the choice depends on three questions about rater selection rater coverage and the unit of measurement being reported — Intraclass correlation coefficient as variance decomposition with the six Shrout-Fleiss models

The coefficient was introduced by Ronald Fisher in 1925 and formalized for inter-rater reliability in Patrick Shrout and Joseph Fleiss's 1979 Psychological Bulletin paper, which defined the six ICC models that remain canonical. Terry Koo and Mae Li's 2016 Journal of Chiropractic Medicine paper, "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research", provides the modern selection framework most clinical researchers use.

The Six Shrout and Fleiss Models and How to Pick the Right One

Shrout and Fleiss 1979 defined six intraclass correlation coefficient models indexed by two numbers: the model number (1, 2, or 3) and the unit indicator (1 for single-measure, k for average-measure). The six combinations are ICC(1,1), ICC(1,k), ICC(2,1), ICC(2,k), ICC(3,1), ICC(3,k).

ICC(1,1) is the one-way random-effects single-measure coefficient. Each subject is assessed by a different set of randomly selected raters; ratings are nested within subjects. This is the right model when there is no fixed crossed design (for example, when each patient is seen by a different doctor pulled at random from a pool).

ICC(1,k) is the one-way random-effects average-measure coefficient. Same design as ICC(1,1) but the unit being reported is the average of k ratings rather than a single rating. Use this when the clinical decision rule is based on averaging multiple raters' scores.

ICC(2,1) is the two-way random-effects single-measure coefficient. Each subject is assessed by the same set of raters who were randomly selected from a larger pool. Raters are crossed with subjects; both are random factors. Use this when the raters in the study are a sample of a broader population of raters to whom the result will generalize.

ICC(2,k) is the two-way random-effects average-measure coefficient. Same design as ICC(2,1) but reports the average-of-k coefficient.

ICC(3,1) is the two-way mixed-effects single-measure coefficient. Each subject is assessed by the same set of raters, but the raters are not a random sample; they are the only raters of interest and inferences will not generalize beyond them. Raters are a fixed factor; subjects are random.

ICC(3,k) is the two-way mixed-effects average-measure coefficient. Same as ICC(3,1) but reports the average-of-k coefficient.

The choice between models 1, 2, and 3 hinges on the rater-selection question. The choice between single-measure and average-measure depends on the unit of reporting. These two choices, combined, determine the coefficient.

The Mathematical Definition: Variance Components

ICC is defined as a ratio of variance components. For the simplest one-way random-effects model:

ICC(1,1) = sigma^2_between / (sigma^2_between + sigma^2_within)

Where sigma^2_between is the variance attributable to systematic differences between subjects and sigma^2_within is the variance attributable to measurement error within subjects (across the multiple raters or repeated measurements). When all variance is between subjects (perfect reliability), ICC = 1. When all variance is within subjects (no reliability), ICC = 0.

For the two-way models (2 and 3), the variance decomposition becomes more complex because rater variance is separated from residual error. The two-way random-effects ICC(2,1) is:

ICC(2,1) = sigma^2_subject / (sigma^2_subject + sigma^2_rater + sigma^2_error)

The two-way mixed-effects ICC(3,1) drops sigma^2_rater from the denominator because raters are treated as fixed:

ICC(3,1) = sigma^2_subject / (sigma^2_subject + sigma^2_error)

This last form is structurally identical to the older Pearson product-moment correlation when subjects are crossed with two raters and the raters' systematic difference is ignored.

The variance components themselves are estimated from a one-way or two-way analysis of variance applied to the rater-by-subject matrix. The analysis of variance variance components walkthrough covers how the mean squares decompose; ICC simply uses those mean squares in specific ratios that depend on the model.

Koo and Li's 2016 Selection Framework

Koo and Li 2016 distilled the model-selection problem into three questions.

Question 1: Is the same set of raters scoring every subject? If yes, you are in a two-way (model 2 or 3) framework. If no, you are in a one-way (model 1) framework. Most reliability studies use the same raters for every subject, putting them in model 2 or model 3.

Question 2: Are the raters in the study a representative sample of a larger rater pool, or are they the only raters of interest? If they are a sample (and the researcher wants to generalize to other raters), use model 2 (two-way random-effects). If they are the only raters of interest (the result will not generalize beyond them), use model 3 (two-way mixed-effects). In practice, most published reliability studies use model 2 because the goal is to generalize.

Question 3: What is the unit of reporting? If the clinical decision rule scores each subject with a single rater, report single-measure. If the rule averages multiple raters' scores, report average-measure (the k subscript).

Koo and Li also recommend reporting the model, the single-measure or average-measure designation, and the 95% confidence interval in the same statement, for example: "ICC(2,1) = 0.85, 95% CI 0.78 to 0.91, two-way random-effects, single-measure."

Koo and Li 2016 decision tree for ICC model selection

ICC Versus Pearson Correlation Versus Cohen's Kappa

The three coefficients answer different questions.

Pearson product-moment correlation measures the strength of linear association between two continuous variables. It is symmetric in the two variables. Importantly, Pearson r ignores systematic bias: if Rater A scores everyone exactly 1 point higher than Rater B, Pearson r is 1.0 even though the raters do not agree. ICC penalizes that systematic bias when the two-way model is used.

Cohen's kappa measures agreement between two raters on a categorical outcome, correcting for the agreement that would arise by chance. Kappa is appropriate for nominal or ordinal categorical ratings. The Cohen's kappa for categorical agreement walkthrough drills into the categorical case. ICC is for continuous outcomes; kappa is for categorical. They are not interchangeable.

Intraclass correlation coefficient measures the proportion of total variance attributable to between-subject variability for continuous outcomes. It can accommodate two raters or many raters, and it can decompose variance to penalize systematic rater bias separately from random error.

The practical rule: continuous outcome with multiple raters means ICC. Categorical outcome with two raters means kappa. The choosing the right statistical test decision guide walks through related selection cases.

Need statistical analysis support?

Our PhD statisticians handle data analysis, produce reproducible R code, and write results sections that satisfy peer reviewers.

Get a Free Quote

Confidence Intervals: Fisher z-Transform and Bootstrap

A point estimate of ICC is almost never sufficient. Reliability research expects a 95% confidence interval that quantifies the precision of the estimate.

The Fisher z-transformation is the classical approach. Apply z = 0.5 * ln((1 + ICC) / (1 - ICC)), construct a confidence interval on the z scale with the appropriate standard error, and back-transform to the ICC scale. The standard error depends on the number of subjects, the number of raters, and the model. R's psych::ICC() and irr::icc() compute Fisher-based confidence intervals by default.

Bootstrap confidence intervals are recommended when sample sizes are small (under 30 subjects) or when the assumptions of the parametric Fisher approach are dubious. Resample subjects with replacement, compute ICC on each bootstrap sample, and use the percentile method on the resulting bootstrap distribution. R's boot package combined with a wrapper function for psych::ICC() produces bootstrap intervals readily.

A common mistake is to report the point estimate without the interval. A point estimate of 0.85 with a 95% confidence interval of 0.40 to 0.95 means very different things than a point estimate of 0.85 with a 95% confidence interval of 0.82 to 0.88. The former says "we cannot rule out poor reliability"; the latter says "reliability is good and the estimate is precise."

Interpretation Thresholds: From Poor to Excellent Reliability

Koo and Li 2016 proposed widely cited thresholds for clinical reliability research:

Frequently Asked Questions

An intraclass correlation coefficient of 0.7 indicates that 70 percent of the total variance in the measurement is attributable to differences between subjects and 30 percent to measurement error or rater disagreement. Per Koo and Li 2016 thresholds, 0.7 sits in the moderate reliability range (0.50 to 0.75) at the upper boundary, just below the threshold for good reliability (0.75 to 0.90). Always interpret with the 95% confidence interval: 0.7 with a tight interval (0.65 to 0.75) means a different thing than 0.7 with a wide interval (0.40 to 0.85).

Pearson r measures the strength of linear association between two variables and is symmetric in the variables. ICC measures the proportion of total variance attributable to between-subject differences, with options that penalize systematic rater bias. If Rater A scores everyone exactly 1 point higher than Rater B, Pearson r is 1.0 (perfect linear association) but ICC(2,1) drops below 1.0 because it penalizes the systematic bias. ICC is the right statistic for reliability research; Pearson r is for association studies.

Use the Koo and Li 2016 framework. (1) If the same raters score every subject, use a two-way model (2 or 3). (2) If raters are a sample of a larger pool you want to generalize to, use model 2; if raters are the only raters of interest, use model 3. (3) Report single-measure if clinical decisions use one rating, average-measure if they use the average of multiple ratings. For most published clinical reliability studies, ICC(2,1) (two-way random-effects single-measure) is the standard choice.

Not identical, but related. Cronbach's alpha measures internal consistency reliability for a multi-item scale (the items function as raters). ICC(3,k) (two-way mixed-effects average-measure) is mathematically equivalent to standardized Cronbach's alpha when items are scored on the same scale. For inter-rater reliability of a single measurement across multiple raters, ICC is the right statistic; for internal consistency of a multi-item scale, alpha is conventional.

The point estimate ranges from 0 (no reliability) to 1 (perfect reliability). Always pair it with the 95% confidence interval. Use Koo and Li 2016 thresholds: below 0.50 poor; 0.50 to 0.75 moderate; 0.75 to 0.90 good; above 0.90 excellent. Calibrate the interpretation to the use case: research averaging across participants tolerates moderate reliability; individual clinical decision-making requires high reliability.

For most clinical research applications, a point estimate above 0.75 with a tight 95% confidence interval that stays above 0.60 is considered good reliability. Above 0.90 is excellent reliability and is the typical expectation for measurements that inform individual clinical decisions. Below 0.50 is poor and the measurement is not reliable for practical use. The appropriate threshold depends on the consequences of measurement error in the specific use case.

Patient   Rater 1   Rater 2
1         145       148
2         132       135
3         120       119
4         158       162
5         110       113

library(psych)
icc_data <- data.frame(rater1 = c(145,132,120,158,110),
                       rater2 = c(148,135,119,162,113))
psych::ICC(icc_data)

Intraclass Correlation Coefficient: Models, Software, Guide | Research Gold

Intraclass Correlation Coefficient: Models and Reporting

Key Takeaways

The Six Shrout and Fleiss Models and How to Pick the Right One

The Mathematical Definition: Variance Components

Koo and Li's 2016 Selection Framework

ICC Versus Pearson Correlation Versus Cohen's Kappa

Confidence Intervals: Fisher z-Transform and Bootstrap

Interpretation Thresholds: From Poor to Excellent Reliability

Frequently Asked Questions

Related Articles

Need a Statistician? Our PhD Team Handles the Numbers.

Dr. Sarah Mitchell

Need a Statistician? Our PhD Team Handles the Numbers.

Sample Size Planning for ICC Studies

Software: R, SPSS, Stata

Worked Example: Two Raters Score Five Patients

Reporting Reliability in a Manuscript

Frequently Asked Questions

What does an ICC of 0.7 mean?

What is the difference between ICC and Pearson correlation?

Which ICC model should I report?

Is ICC the same as Cronbach alpha?

How do you interpret intraclass correlation coefficient?

What is a good ICC value?

Related Articles