Fit a three-level random-effects meta-analysis for dependent effect sizes nested within studies or clusters. The model estimates a between-cluster and a within-cluster variance component by restricted maximum likelihood, reports the intraclass correlation, and gives you both a model-based and a cluster-robust confidence interval for the pooled effect. Add a moderator for three-level meta-regression, then export ready-to-run metafor rma.mv and clubSandwich R code along with a publication-ready methods paragraph.
One effect per line, comma or tab separated: Study label, Cluster id, Effect size, Variance. Effects sharing a cluster id are treated as statistically dependent (level 2 within level 3).
Drag & drop a file or
CSV, TSV, Excel (.xlsx/.xls) - max 2000 rows
13 effects parsed across 5 clusters.
| Coefficient | Estimate | Model SE | Model 95% CI | Robust SE (CR2) | Robust 95% CI |
|---|---|---|---|---|---|
| Pooled effect | 0.3125 | 0.0722 | 0.171 to 0.454 | 0.0729 | 0.104 to 0.521 (df 3.7) |
Squares are individual effects (area proportional to inverse-variance weight) with 95% confidence intervals, grouped by cluster. The green diamond is the pooled estimate and the dashed teal bar is the 95% prediction interval.
library(metafor)
library(clubSandwich)
yi <- c(0.3, 0.42, 0.38, 0.1, 0.05, 0.18, 0.12, 0.55, 0.49, 0.22, 0.34, 0.41, 0.37)
vi <- c(0.04, 0.05, 0.03, 0.06, 0.04, 0.05, 0.07, 0.05, 0.06, 0.05, 0.04, 0.06, 0.05)
cluster <- c("District 1", "District 1", "District 1", "District 2", "District 2", "District 2", "District 2", "District 3", "District 3", "District 4", "District 4", "District 5", "District 5")
effect <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
dat <- data.frame(yi, vi, cluster, effect)
# three-level random-effects model (effects nested within clusters)
res <- rma.mv(yi, vi, random = ~ 1 | cluster/effect,
data = dat, method = "REML")
summary(res)
# profile-likelihood confidence intervals for the variance components
confint(res)
# cluster-robust (RVE) inference, small-sample CR2 + Satterthwaite df
robust(res, cluster = dat$cluster, clubSandwich = TRUE)A three-level random-effects meta-analysis was fitted by restricted maximum likelihood (REML) to account for the dependence among 13 effect sizes nested within 5 clusters. The model partitioned heterogeneity into a between-cluster component (sigma-squared level 3 = 0.0071) and a within-cluster between-effect component (sigma-squared level 2 = 0.0000), giving a total tau-squared of 0.0071 and an intraclass correlation of 1.000. Profile-likelihood 95% confidence intervals were 0.0000 to 0.1311 for the level-3 variance and 0.0000 to 0.0295 for the level-2 variance. A likelihood-ratio test indicated that the between-cluster (level-3) variance did not significantly improve model fit over a two-level model (LRT = 0.29, p = 0.2946, boundary-corrected). The pooled estimate was 0.312 (model-based 95% CI 0.171 to 0.454). Because the number of clusters was modest, inference was also based on cluster-robust standard errors using the CR2 bias-reduced estimator with Satterthwaite degrees of freedom, yielding a 95% CI of 0.104 to 0.521 on 3.7 degrees of freedom. The 95% prediction interval for the true effect in a new study was 0.095 to 0.530. Total heterogeneity across all effects was Q(12) = 5.79 (p = 0.9261), I-squared = 0.0%. The accompanying R code reproduces this analysis with metafor and clubSandwich.
Next step
End-to-end systematic review, meta-analysis, or biostatistics support, handled by a PhD methodologist.
Our promise: Free rework on search, screening, or synthesis if reviewers push back.
Will your review include a meta-analysis? We run the systematic review and the pooled analysis together, in one project.
Quote my review + meta-analysisTimeline
Most projects deliver in under 2 weeks. We confirm an exact date in your quote.
If reviewers push back
If reviewers question the search, screening, or synthesis, we rework the section free.
Confidentiality
NDA available on request before any project discussion. Your data, study design, and manuscript stay private either way.
Enter one effect size per line as study label, cluster id, effect size, variance. Effects sharing a cluster id are treated as statistically dependent at level 2 within their level-3 cluster.
Tick the standard error box if your fourth column holds standard errors rather than variances. Tick the moderator box and add a fifth numeric column to run a three-level meta-regression. Choose 90%, 95%, or 99% confidence.
Inspect the between-cluster (level 3) and within-cluster (level 2) variance components and the intraclass correlation to judge how much heterogeneity sits between clusters versus within them.
Compare the model-based confidence interval, which trusts the REML variance components, with the cluster-robust interval, which is more conservative. With few clusters, trust the cluster-robust interval.
Review Cochran's Q and I-squared for total heterogeneity across all effects, then use the variance split to see whether that heterogeneity is driven by between-cluster or within-cluster differences.
Copy the reproducible metafor rma.mv and clubSandwich R code, and copy the auto-generated methods paragraph that follows standard reporting conventions, directly into your manuscript.
Want a PhD methodologist to handle the whole project?
Get a complete systematic review or meta-analysis handled end-to-end. Free rework on search, screening, or synthesis if reviewers push back. Pay only after you approve your quote.
Standard two-level meta-analysis assumes every effect size is independent. When several effects come from the same study, sample, or laboratory, that assumption fails. Fitting a two-level model to dependent data understates the standard error of the pooled estimate and inflates the false positive rate. The three-level model adds a clustering level that restores honest inference.
Instead of a single tau-squared, the three-level model estimates a level-3 variance for differences between clusters and a level-2 variance for differences between effects within the same cluster. Reporting both shows reviewers exactly where the heterogeneity lives, which a single pooled tau-squared cannot reveal.
The intraclass correlation is the share of total heterogeneity sitting at the cluster level. A high value means effects within a cluster are strongly correlated and the clustering matters a great deal. A value near zero means the clustering adds little and a simpler model would give similar results.
Model-based intervals trust the estimated variance components completely. Cluster-robust (sandwich) intervals build the standard error from the observed within-cluster residuals, so they stay valid even when the assumed dependence structure is not exactly right. With a modest number of clusters, the cluster-robust interval is the safer report.
The reliability of a three-level model depends on the number of clusters, not the number of effects. With fewer than about ten clusters, the naive robust standard error is biased downward. The exported clubSandwich code applies the CR2 bias correction and Satterthwaite degrees of freedom, keeping the false positive rate close to nominal even with five or six clusters.
When each cluster contains a single effect size, the level-3 variance is unidentified and the model collapses exactly to a standard two-level random-effects meta-analysis. This means you can adopt the three-level model as a safe default for any dataset with potential dependence, losing nothing when the dependence is absent.
The classical random-effects meta-analysis assumes that the effect sizes entering the pooled estimate are mutually independent, each one drawn from a different study contributing a single result. Real systematic reviews rarely satisfy this assumption. A single primary study often reports several outcome measures, several follow-up time points, several intervention arms compared against a shared control, or several subgroups, and each of those produces an effect size. Effects extracted from the same study, the same research group, the same laboratory, or even the same country tend to be more alike than effects drawn at random from the literature. When this dependence is ignored, the pooled standard error is biased downward, confidence intervals are too narrow, and the probability of a false positive rises well above the nominal level. The three-level meta-analysis formalized by Konstantopoulos (2011), Van den Noortgate and colleagues (2013), and Cheung (2014) addresses this problem directly by adding a clustering level to the random-effects structure.
In a multilevel meta-analysis, variability is partitioned across three levels. Level one is the sampling variance of each observed effect size, which is treated as known and supplied as the variance or squared standard error of each estimate. Level two captures the variation between effect sizes that belong to the same cluster, for example the differences among the several outcomes reported by one study. Level three captures the variation between clusters, for example the differences between studies. The model therefore estimates two variance components instead of one: a within-cluster between-effect variance and a between-cluster variance. This tool estimates both by restricted maximum likelihood (REML), the same estimator the metafor package uses by default for its rma.mv function, because REML corrects for the degrees of freedom consumed by the fixed effects and reduces the downward bias that maximum likelihood shows when the number of clusters is small.
The practical payoff is an honest standard error. Because the model knows that effects from the same cluster are correlated, it does not treat each dependent effect as if it were a fresh independent data point. The intraclass correlation reported alongside the variance components tells you how strong that within-cluster correlation is, expressed as the proportion of total heterogeneity attributable to the cluster level. When the intraclass correlation is high, the clustering structure is doing substantial work and a naive two-level analysis would have badly understated the uncertainty. When it is near zero, the effects within a cluster are about as different as effects across clusters, the clustering adds little, and the three-level estimate converges on the two-level one. This calculator has been verified to reproduce the standard two-level REML estimate exactly in the limiting case where every cluster contains a single effect size, confirming that the multilevel engine is a faithful generalization rather than an approximation.
Even a correctly specified three-level model rests on the assumption that the working covariance structure is right, and that assumption is rarely testable. Cluster-robust variance estimation, also called robust variance estimation or the sandwich estimator, provides a safeguard. Rather than trusting the estimated variance components completely, it constructs the standard error of the pooled estimate from the residuals observed within each cluster, so the resulting interval remains valid even when the dependence has been modeled imperfectly. This tool reports both intervals side by side: a model-based interval that trusts the REML variance components, and a cluster-robust interval built from the within-cluster residuals. When the number of clusters is modest, the cluster-robust interval is the more trustworthy of the two. The exported R code uses the clubSandwich package to add the CR2 bias-reduced estimator and Satterthwaite degrees of freedom, the small-sample correction recommended by Tipton (2015) and Pustejovsky and Tipton (2018) when there are few clusters, which keeps the error rate close to the nominal level even with as few as five or six clusters.
The three-level structure also extends naturally to meta-regression. By adding a numeric moderator, this tool fits a three-level mixed-effects model that estimates an intercept and a slope while still partitioning the residual heterogeneity into level-two and level-three components. This lets you ask whether a study-level or effect-level covariate, such as dose, year, mean age, or risk-of-bias score, explains part of the heterogeneity, all while correctly accounting for the dependence among effects. The moderator slope is reported with both a model-based and a cluster-robust confidence interval, so you can judge its significance under both inferential approaches. After estimating the model here, you can deepen the analysis with our meta-regression data formatter or visualize the relationship with our bubble plot generator.
Multilevel meta-analysis fits naturally into a broader evidence-synthesis workflow. Begin by standardizing every effect onto a common scale with our effect size calculator, since mixing measures or units invalidates the pooled estimate just as it would in a two-level model. Once you have fitted the three-level model here, display the pooled result and the contributing effects with our forest plot generator, and assess publication bias with our funnel plot and Egger's test tool, keeping in mind that bias diagnostics designed for independent effects should be interpreted cautiously when effects are clustered. For reviews comparing several interventions at once, our network meta-analysis helper handles indirect comparisons and treatment ranking. Reporting the variance components, the intraclass correlation, and the cluster-robust interval together gives readers a transparent account of how dependence was handled, which is increasingly expected by methodological reviewers and consistent with the guidance in the Cochrane Handbook on units-of-analysis issues.
A three-level meta-analysis is a random-effects model that accounts for statistical dependence among effect sizes that share a common source. The three levels are: sampling variance of each observed effect (level 1), variation between effect sizes within the same cluster such as a study, sample, or laboratory (level 2), and variation between clusters (level 3). Standard two-level meta-analysis assumes every effect size is independent, which is violated when a single study reports several outcomes, several time points, or several subgroups. Ignoring that dependence understates the standard error of the pooled estimate and produces confidence intervals that are too narrow. The three-level model estimates a separate variance component at level 2 and level 3, so the pooled effect carries an honest standard error. This calculator fits the model by restricted maximum likelihood, which is the estimator the metafor package uses by default for rma.mv.
Use a multilevel model whenever the effect sizes in your dataset are not mutually independent. The most common situation is multiple effect sizes extracted from the same study, for example several outcome measures, several follow-up times, or several intervention arms compared against a shared control. Other situations include effect sizes from the same research group, the same laboratory, the same country, or the same patient sample. If you fit a standard two-level model to dependent data, the pooled standard error is biased downward and the false positive rate rises. The three-level model addresses this by adding a clustering level. If the level-3 variance turns out to be near zero, the model collapses to a standard random-effects meta-analysis, so you lose nothing by checking.
The intraclass correlation (ICC) reported by this tool is the proportion of total heterogeneity that sits at the cluster level (level 3) rather than within clusters (level 2). It is computed as the level-3 variance divided by the sum of the level-3 and level-2 variances. An ICC near 1 means almost all of the between-effect variability is explained by differences between clusters, so effects from the same cluster are highly correlated. An ICC near 0 means effects within a cluster are about as different from each other as effects from different clusters, which suggests the clustering structure adds little and a simpler model may suffice. The ICC helps you judge how much the dependence structure matters for your specific dataset.
Cluster-robust variance estimation, also called robust variance estimation (RVE) or the sandwich estimator, produces standard errors that remain valid even when the working model for the dependence is not exactly correct. Instead of trusting the estimated variance components completely, it builds the standard error from the observed residuals within each cluster. This tool reports both a model-based confidence interval (which trusts the REML variance components) and a cluster-robust confidence interval (which is more conservative and robust to misspecification). The robust interval is computed in your browser using the CR2 small-sample bias correction with Bell-McCaffrey Satterthwaite degrees of freedom, the same method that the clubSandwich R package applies, rather than the naive sandwich estimator that under-covers with few clusters. When the number of clusters is modest, the cluster-robust interval is the safer choice. For very large datasets the tool falls back to the basic sandwich estimator for speed and tells you when it does, while the exported clubSandwich R code always applies the full CR2 correction.
The number of clusters, not the number of effect sizes, drives the reliability of the variance components and the cluster-robust standard errors. With fewer than about ten clusters, the level-3 variance is estimated imprecisely and the naive cluster-robust standard error is biased downward. This is why small-sample corrections matter: the CR2 estimator with Satterthwaite degrees of freedom, available in the exported clubSandwich R code, keeps the false positive rate close to the nominal level even with as few as five or six clusters. This calculator computes the model-based interval using a normal approximation and the cluster-robust interval using a t-distribution with degrees of freedom equal to the number of clusters minus the number of model coefficients, which is a transparent and conservative default. For formal small-sample inference, run the exported R code.
Yes, and you are not limited to a single covariate. Enable the moderator option and add as many columns as you like from the fifth column onward. Each column is detected automatically: a column of numbers is entered as a continuous predictor, while a column of text labels is dummy-coded, with the first level it encounters used as the reference category. The tool then fits a three-level mixed-effects meta-regression, estimating an intercept and a coefficient for every predictor while still partitioning the residual heterogeneity into level-2 and level-3 variance components. The coefficient table reports each slope with both a model-based and a cluster-robust confidence interval, and an omnibus Wald test (Q_M) reports whether the moderators jointly explain heterogeneity. A header row lets you name each coefficient. The exported R code adds the moderators to the rma.mv call via the mods argument, wrapping categorical columns in factor(), so you can reproduce and extend the analysis in metafor.
The confidence interval describes the uncertainty in the average effect: it is the range that is likely to contain the true mean effect across all studies. The prediction interval is wider and answers a different question: where is the true effect of a new, as-yet-unconducted study likely to fall. It combines the uncertainty in the mean with the total between-study heterogeneity, so it reflects how much real effects vary from context to context rather than just sampling noise. This tool reports the prediction interval the same way metafor's rma.mv predict() function does, using the normal quantile and the total heterogeneity variance, so the bounds match metafor exactly. When you have substantial heterogeneity, the prediction interval is often the most honest summary to report, because two studies can both be consistent with the pooled estimate yet have very different true effects. When you fit moderators, the interval is reported at the reference covariate values.
When a single study reports several effect sizes from overlapping samples, the sampling errors of those effects are correlated, not just the underlying true effects. The within-cluster sampling correlation control lets you set that correlation. Leaving it at zero treats the sampling errors as independent, which is the classic three-level model. Raising it, commonly to a value between 0.6 and 0.8 when the exact correlations are unknown, fits the correlated-and-hierarchical-effects (CHE) working model of Pustejovsky and Tipton (2022), which combines a constant within-cluster sampling correlation with the random-effects clustering structure. The CHE model paired with cluster-robust variance estimation is currently the recommended default for dependent effect sizes, because the robust standard errors stay valid even if the assumed correlation is not exactly right. The exported R code builds the corresponding block-diagonal covariance with metafor's vcalc function and passes it to rma.mv.
Yes. The tool reports a likelihood-ratio test that compares the full three-level model against a reduced two-level model with the between-cluster (level-3) variance fixed at zero. Because the null value sits on the boundary of the parameter space, the p-value uses the boundary-corrected reference distribution, a fifty-fifty mixture of chi-square with zero and one degrees of freedom, which is the correct test for a variance component. A significant result means the clustering level genuinely improves fit and the three-level structure is justified; a non-significant result suggests a simpler two-level model may suffice. The tool also reports profile-likelihood confidence intervals for both the level-2 and level-3 variance components, the same approach as metafor's confint function. Profile-likelihood intervals are asymmetric and respect the non-negativity of variances, so they are more honest than a symmetric standard-error interval, especially when a component is estimated near zero.
This calculator estimates the level-2 and level-3 variance components by restricted maximum likelihood (REML), the same default that the metafor package uses for rma.mv. REML is generally preferred over maximum likelihood because it corrects for the degrees of freedom used in estimating the fixed effects, which reduces downward bias in the variance components, especially with few clusters. The optimization uses a bounded coordinate-descent search with golden-section minimization on each component, which respects the non-negativity constraint on variances. The engine has been verified to reproduce the standard two-level REML estimate exactly when each cluster contains a single effect size, confirming the math is correct.
Enter one effect size per line with comma-separated or tab-separated columns in the order: study label, cluster id, effect size, variance. Effect sizes that share the same cluster id are treated as statistically dependent and grouped at level 2 within their level-3 cluster. The variance is the squared standard error of each effect size; if your data has standard errors instead, tick the standard error checkbox and the tool squares them for you. To run a meta-regression, tick the moderator checkbox and add one or more moderator columns from the fifth column onward. You can also import a CSV, TSV, or Excel file by dragging it onto the upload area or pasting from a spreadsheet; the tool fuzzy-matches common column names (effect size, yi, SMD, variance, vi, standard error, study, cluster) and treats every other column as a moderator. The cluster id can be a study name, a sample identifier, a laboratory, or any grouping that induces dependence among the effects.
A standard random-effects meta-analysis has two levels: sampling error and a single between-study variance (tau-squared). It assumes every effect size is independent. The three-level model splits that single heterogeneity term into two pieces, one for variation within clusters and one for variation between clusters, and it correctly models the correlation among effects from the same cluster. The practical consequence is a more honest, usually wider, confidence interval for the pooled estimate when your data contain dependent effects. If you have only one effect size per study, the two models give identical results, which is why the three-level model is a safe generalization rather than a different method.
Yes. The tool generates a complete, ready-to-run R script using the metafor package's rma.mv function with a random structure of ~ 1 | cluster/effect, which specifies effects nested within clusters, and method = REML. It also generates the clubSandwich robust() call that produces the cluster-robust standard errors with the CR2 small-sample correction and Satterthwaite degrees of freedom. Copy the script with one click and paste it into RStudio. Reproducing your analysis in metafor is valuable for peer review, where reviewers often request the underlying code, and it lets you extend the model with additional random effects, correlated sampling errors, or more moderators.
Display your pooled estimate and contributing effects with our forest plot generator with REML estimation and prediction intervals. Standardize effect sizes onto a common scale first with our effect size calculator for SMD, OR, and RR. Explore moderators that may explain heterogeneity with our meta-regression data formatter for R, Stata, and CMA, and visualize the fit with our bubble plot generator. Assess small-study effects and publication bias with our funnel plot and publication bias tool. Test the robustness of your pooled estimate with our leave-one-out sensitivity analysis simulator. For reviews comparing multiple interventions, our network meta-analysis helper supports indirect comparisons and treatment ranking.
Reviewed by
Dr. Sarah Mitchell holds a PhD in Biostatistics from Johns Hopkins Bloomberg School of Public Health and has over 15 years of experience in systematic review methodology and meta-analysis. She has authored or co-authored 40+ peer-reviewed publications in journals including the Journal of Clinical Epidemiology, BMC Medical Research Methodology, and Research Synthesis Methods. A former Cochrane Review Group statistician and current editorial board member of Systematic Reviews, Dr. Mitchell has supervised 200+ evidence synthesis projects across clinical medicine, public health, and social sciences. She reviews all Research Gold tools to ensure statistical accuracy and compliance with Cochrane Handbook and PRISMA 2020 standards.
Whether you have data that needs writing up, a thesis deadline approaching, or a full study to run from scratch, we handle it. Most projects deliver in under 2 weeks.
Our promise: Free rework on search, screening, or synthesis if reviewers push back.
Will your review include a meta-analysis? Quote my systematic review and meta-analysis