The intraclass correlation coefficient measures the proportion of total variance in a measurement that is attributable to differences between subjects, used to quantify agreement or reliability among raters or repeated measurements on a continuous outcome. ICC is the right statistic when two or more raters score the same subjects on a continuous scale, or when one rater scores the same subject more than once, and the question is how much of the variance reflects real differences between subjects versus measurement noise. Unlike Pearson correlation, which captures linear association regardless of agreement, the intraclass correlation coefficient distinguishes systematic rater bias from random error and produces a single coefficient in which 0 means no reliability and 1 means perfect reliability.
The coefficient was introduced by Ronald Fisher in 1925 and formalized for inter-rater reliability in Patrick Shrout and Joseph Fleiss's 1979 Psychological Bulletin paper, which defined the six ICC models that remain canonical. Terry Koo and Mae Li's 2016 Journal of Chiropractic Medicine paper, "A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research", provides the modern selection framework most clinical researchers use.
The Six Shrout and Fleiss Models and How to Pick the Right One
Shrout and Fleiss 1979 defined six intraclass correlation coefficient models indexed by two numbers: the model number (1, 2, or 3) and the unit indicator (1 for single-measure, k for average-measure). The six combinations are ICC(1,1), ICC(1,k), ICC(2,1), ICC(2,k), ICC(3,1), ICC(3,k).
ICC(1,1) is the one-way random-effects single-measure coefficient. Each subject is assessed by a different set of randomly selected raters; ratings are nested within subjects. This is the right model when there is no fixed crossed design (for example, when each patient is seen by a different doctor pulled at random from a pool).
ICC(1,k) is the one-way random-effects average-measure coefficient. Same design as ICC(1,1) but the unit being reported is the average of k ratings rather than a single rating. Use this when the clinical decision rule is based on averaging multiple raters' scores.
ICC(2,1) is the two-way random-effects single-measure coefficient. Each subject is assessed by the same set of raters who were randomly selected from a larger pool. Raters are crossed with subjects; both are random factors. Use this when the raters in the study are a sample of a broader population of raters to whom the result will generalize.
ICC(2,k) is the two-way random-effects average-measure coefficient. Same design as ICC(2,1) but reports the average-of-k coefficient.
ICC(3,1) is the two-way mixed-effects single-measure coefficient. Each subject is assessed by the same set of raters, but the raters are not a random sample; they are the only raters of interest and inferences will not generalize beyond them. Raters are a fixed factor; subjects are random.
ICC(3,k) is the two-way mixed-effects average-measure coefficient. Same as ICC(3,1) but reports the average-of-k coefficient.
The choice between models 1, 2, and 3 hinges on the rater-selection question. The choice between single-measure and average-measure depends on the unit of reporting. These two choices, combined, determine the coefficient.
The Mathematical Definition: Variance Components
ICC is defined as a ratio of variance components. For the simplest one-way random-effects model:
ICC(1,1) = sigma^2_between / (sigma^2_between + sigma^2_within)
Where sigma^2_between is the variance attributable to systematic differences between subjects and sigma^2_within is the variance attributable to measurement error within subjects (across the multiple raters or repeated measurements). When all variance is between subjects (perfect reliability), ICC = 1. When all variance is within subjects (no reliability), ICC = 0.
For the two-way models (2 and 3), the variance decomposition becomes more complex because rater variance is separated from residual error. The two-way random-effects ICC(2,1) is:
ICC(2,1) = sigma^2_subject / (sigma^2_subject + sigma^2_rater + sigma^2_error)
The two-way mixed-effects ICC(3,1) drops sigma^2_rater from the denominator because raters are treated as fixed:
ICC(3,1) = sigma^2_subject / (sigma^2_subject + sigma^2_error)
This last form is structurally identical to the older Pearson product-moment correlation when subjects are crossed with two raters and the raters' systematic difference is ignored.
The variance components themselves are estimated from a one-way or two-way analysis of variance applied to the rater-by-subject matrix. The analysis of variance variance components walkthrough covers how the mean squares decompose; ICC simply uses those mean squares in specific ratios that depend on the model.
Koo and Li's 2016 Selection Framework
Koo and Li 2016 distilled the model-selection problem into three questions.
Question 1: Is the same set of raters scoring every subject? If yes, you are in a two-way (model 2 or 3) framework. If no, you are in a one-way (model 1) framework. Most reliability studies use the same raters for every subject, putting them in model 2 or model 3.
Question 2: Are the raters in the study a representative sample of a larger rater pool, or are they the only raters of interest? If they are a sample (and the researcher wants to generalize to other raters), use model 2 (two-way random-effects). If they are the only raters of interest (the result will not generalize beyond them), use model 3 (two-way mixed-effects). In practice, most published reliability studies use model 2 because the goal is to generalize.
Question 3: What is the unit of reporting? If the clinical decision rule scores each subject with a single rater, report single-measure. If the rule averages multiple raters' scores, report average-measure (the k subscript).
Koo and Li also recommend reporting the model, the single-measure or average-measure designation, and the 95% confidence interval in the same statement, for example: "ICC(2,1) = 0.85, 95% CI 0.78 to 0.91, two-way random-effects, single-measure."
ICC Versus Pearson Correlation Versus Cohen's Kappa
The three coefficients answer different questions.
Pearson product-moment correlation measures the strength of linear association between two continuous variables. It is symmetric in the two variables. Importantly, Pearson r ignores systematic bias: if Rater A scores everyone exactly 1 point higher than Rater B, Pearson r is 1.0 even though the raters do not agree. ICC penalizes that systematic bias when the two-way model is used.
Cohen's kappa measures agreement between two raters on a categorical outcome, correcting for the agreement that would arise by chance. Kappa is appropriate for nominal or ordinal categorical ratings. The Cohen's kappa for categorical agreement walkthrough drills into the categorical case. ICC is for continuous outcomes; kappa is for categorical. They are not interchangeable.
Intraclass correlation coefficient measures the proportion of total variance attributable to between-subject variability for continuous outcomes. It can accommodate two raters or many raters, and it can decompose variance to penalize systematic rater bias separately from random error.
The practical rule: continuous outcome with multiple raters means ICC. Categorical outcome with two raters means kappa. The choosing the right statistical test decision guide walks through related selection cases.