Which between-study variance estimator should I use?

REML is the recommended default for continuous outcomes and Paule-Mandel is a strong choice for binary outcomes. The classic DerSimonian-Laird estimator underestimates between-study variance with few or non-normal studies, so it is no longer preferred. Pair the estimator with the Hartung-Knapp adjustment.

What is the Hartung-Knapp adjustment?

It replaces the normal-distribution confidence interval for the pooled effect with a t-distribution on k minus 1 degrees of freedom and a modified variance, giving much better coverage when the number of studies is small. It is now widely recommended as the default for random-effects meta-analysis.

How do I handle multiple effect sizes from the same study?

Dependent effect sizes from one sample understate the standard error if treated as independent. Combine them into a composite, fit a three-level model with random effects at the study and effect-size levels, or use robust variance estimation with small-sample correction.

Do you need a systematic review before a meta-analysis?

In almost all cases yes. The meta-analysis pools results, while the systematic review is the transparent process of finding and appraising studies. Pooling informally found studies produces a biased result.

How to Do a Meta-Analysis: Step-by-Step Guide

Research Gold

PhD-Led Research Services

Free personalized quote in 2 min

Free Quote

How to Do a Meta-Analysis: Step-by-Step Guide | Research Gold

To do a meta-analysis you define a precise question, systematically search and screen the relevant studies, extract a comparable effect size and its variance from each one, then combine those estimates with inverse-variance weighting into a single pooled effect and quantify how much the studies genuinely disagree. It is the statistical engine inside a systematic review, and its credibility rests as much on the early protocol and extraction work as on the model you fit at the end.

This guide is written for researchers running a real meta-analysis to a publishable standard, so it covers the mechanics most introductions skip: the variance of each effect metric, the choice of between-study variance estimator, the Hartung-Knapp adjustment, prediction intervals, dependency between effect sizes, and the diagnostics reviewers now expect. If your topic is psychology specifically, pair this with our worked overview of meta-analysis in psychology; the methodology below applies across every field.

Step 1: Define the question and pre-register the protocol

Everything starts with a focused, answerable question. Most reviewers structure it with the PICO framework, specifying the Population, Intervention, Comparison, and Outcome, or PECO for exposure questions. A sharp question dictates your eligibility criteria, your search terms, and the effect size you will eventually pool. Use our PICO framework builder to pin this down before anything else.

The protocol is where doctoral rigor begins, because it commits you to analytical choices before the data can influence them. Pre-register it, typically on PROSPERO for health topics or the Open Science Framework for others, and specify more than the question: name the effect size metric, the pooling model, the tau-squared estimator, whether you will apply the Hartung-Knapp adjustment, your planned subgroups and meta-regression covariates, and your rule for handling multiple effect sizes per study. Each of these decisions changes the result, so deciding them a priori is what separates a confirmatory synthesis from a flexible one. This single discipline does more for credibility than any computation later.

Step 2: Search multiple databases systematically

A meta-analysis is only as complete as its search, and a missed literature biases the pool no matter how clean the statistics. Build a reproducible search strategy combining controlled vocabulary (MeSH in PubMed, Emtree in Embase) with free-text terms joined by Boolean logic, and run it across several databases, typically PubMed, Embase, the Cochrane Library, Web of Science, and a field-specific source such as PsycINFO or CINAHL. Our search strategy builder helps structure searches that hold up to peer review.

Extend the search beyond databases to limit publication bias at the source: search trial registries (ClinicalTrials.gov, the WHO ICTRP), grey literature and dissertations, conference proceedings, and the reference lists of included studies through backward and forward citation chasing. Record the exact strings, databases, and dates so the search is reproducible, and consider a PRESS peer review of the strategy. The search is the part of a meta-analysis most exposed in peer review, so document it as if a referee will rerun it.

Step 3: Screen studies in duplicate against your inclusion criteria

Searches return thousands of records, most irrelevant. Screening happens in two passes: titles and abstracts first, then full texts of the survivors. Best practice uses two independent reviewers at both stages, with disagreements resolved by discussion or a third reviewer. Quantify screening agreement with Cohen's kappa; a value below roughly 0.6 signals that your criteria are ambiguous and need refining before you proceed.

Track every record from identification to inclusion so you can build a PRISMA 2020 flow diagram documenting how many were found, deduplicated, screened, excluded, and included, with reasons for full-text exclusions. Our PRISMA flow generator produces that figure from your screening counts. Alongside screening, appraise each included study for risk of bias using a structured tool such as Cochrane RoB 2 for randomized trials or ROBINS-I for non-randomized studies, because those judgments feed both your sensitivity analyses and the final certainty rating.

Step 4: Extract effect sizes and their variances

This is the stage where errors do the most damage, because a mistaken effect size propagates into the pooled result, and where doctoral-level work diverges from a basic summary. From each included study you extract not only an effect size but also its variance, because pooling weights every study by the inverse of that variance. Choose the metric a priori from the outcome type.

Continuous outcomes. The standardized mean difference expresses the group difference in pooled standard deviation units. Cohen's d is the mean difference divided by the pooled standard deviation, but it is upward biased in small samples, so apply the Hedges' g correction with the factor J approximately equal to 1 minus 3 divided by (4 times df minus 1). The variance of g is approximately (n1 plus n2)/(n1 times n2) plus g squared divided by 2(n1 plus n2). When all studies use the same well-understood scale, a raw mean difference is more interpretable than a standardized one.

Binary outcomes. Use the risk ratio, odds ratio, or risk difference, and crucially analyze ratio measures on the natural-log scale, where they are roughly symmetric and normally distributed. The variance of the log odds ratio from a 2 by 2 table with cells a, b, c, d is 1/a plus 1/b plus 1/c plus 1/d. Apply a continuity correction (commonly adding 0.5) when a cell is zero, and prefer the Peto odds ratio or an exact method when events are rare, since the standard correction distorts sparse data.

Correlational outcomes. Convert each correlation r with the Fisher z transformation, z equal to 0.5 times the natural log of (1 plus r) over (1 minus r), which stabilizes the variance to approximately 1/(n minus 3). Pool on the z scale and back-transform the pooled estimate to r for reporting.

Our guide to calculating standardized effect sizes covers the conversions between these metrics, and the effect size calculator handles individual comparisons, including from t statistics, F ratios, and odds ratios when a study does not report the raw inputs.

Need help with your meta-analysis?

Our PhD statisticians run complete meta-analyses: effect sizes, forest plots, heterogeneity testing, and publication-ready results sections.

Chat on WhatsApp Get a Free Quote

Step 5: Choose a pooling model and a between-study variance estimator

Pooling is inverse-variance weighting: each study receives weight w_i equal to 1 divided by its variance v_i, and the pooled effect is the sum of w_i times y_i divided by the sum of w_i. More precise studies carry more influence. The two models differ in what they assume about the true effect.

A fixed-effect (common-effect) model assumes every study estimates one identical true effect and that all observed variation is sampling error, so w_i equals 1/v_i. A random-effects model assumes the true effect varies across studies around a mean and estimates both that mean and its spread, the between-study variance tau-squared. The random-effects weight becomes w_i equal to 1 divided by (v_i plus tau-squared). Choose the model from your conceptual question about whether a single true effect plausibly exists, decided a priori, never from the result of a heterogeneity test. In most reviews, where populations, measures, and designs differ, the random-effects model is the honest default.

Under random effects the entire result depends on how you estimate tau-squared, and this is the choice most basic tutorials get wrong by defaulting to the historical method:

DerSimonian-Laird: the original method-of-moments estimator. It is closed-form and historically standard but underestimates tau-squared when studies are few or effects are non-normal, which understates uncertainty. No longer the preferred default.
Restricted maximum likelihood (REML): the recommended default for continuous outcomes and the metafor package default, with good properties across most conditions.
: an iterative estimator that performs well across scenarios and is often preferred for binary outcomes.

Frequently Asked Questions

Define a precise question and pre-register the protocol, search multiple databases systematically, screen studies in duplicate against inclusion criteria, extract each study's effect size and its variance, choose a pooling model and a between-study variance estimator, combine the studies with inverse-variance weighting into a forest plot, then diagnose heterogeneity with Q, I-squared, tau-squared and a prediction interval, test for publication bias, run sensitivity analyses, and rate certainty with GRADE before writing up to PRISMA 2020.

Restricted maximum likelihood (REML) is the recommended default for continuous outcomes and the metafor package default. Paule-Mandel is a strong choice, particularly for binary outcomes, and performs well across conditions. The classic DerSimonian-Laird estimator underestimates the between-study variance when the number of studies is small or the effects are non-normal, so it is no longer the preferred option despite being historically standard. Whichever you pick, pair it with the Hartung-Knapp adjustment for the confidence interval.

The Hartung-Knapp-Sidik-Jonkman adjustment replaces the normal-distribution confidence interval for the pooled effect with a t-distribution on k minus 1 degrees of freedom and a modified variance. It gives substantially better coverage than the standard Wald interval when the number of studies is small, which is most real reviews, so it is now widely recommended as the default. Report it alongside the standard interval, since in rare cases with very few studies it can produce a counterintuitively narrow interval.

Several effect sizes drawn from one sample are statistically dependent, and treating them as independent understates the standard error. Options are to combine them into one composite per study, fit a three-level model with random effects at both the study and effect-size levels, or use robust variance estimation with small-sample correction, which makes few assumptions about the dependency structure. Robust variance estimation is the most general and is implemented in the metafor and robumeta packages in R.

R with the metafor or meta package is the free and most flexible standard and supports REML, Hartung-Knapp, meta-regression, three-level models, and robust variance estimation. RevMan is purpose-built for Cochrane reviews, while Stata and Comprehensive Meta-Analysis are common alternatives. The software only computes the pool; the credibility depends on how the effect sizes and variances were extracted.

There is no fixed minimum, and two studies can technically be pooled, but with fewer than about five studies the between-study variance is estimated very imprecisely and the random-effects interval is unstable. Tests for publication bias such as Egger's regression require at least ten studies to have adequate power. Report the number of studies prominently, since it conditions how much weight the heterogeneity and bias diagnostics can bear.

In almost all cases yes. The meta-analysis is the statistical step that pools results, while the systematic review is the transparent process of finding, screening, and appraising studies. Pooling studies you found informally, without a reproducible search, produces a biased estimate that cannot be defended in peer review.

A rigorous meta-analysis typically takes several months, with the search, dual screening, and data extraction consuming most of the effort. The statistical pooling itself is fast once the data are clean. Timelines shorten considerably with experienced methodological support.

library(metafor)
# yi = effect sizes, vi = their variances
res <- rma(yi, vi, data = dat, method = "REML", test = "knha")
summary(res)            # pooled estimate, CI, Q, I^2, tau^2
predict(res)            # prediction interval
forest(res)            # forest plot
regtest(res)           # Egger's test for funnel asymmetry
trimfill(res)          # trim-and-fill sensitivity
leave1out(res)         # leave-one-out diagnostics

How to Do a Meta-Analysis: A Step-by-Step Guide for Researchers

Key Takeaways

Step 1: Define the question and pre-register the protocol

Step 2: Search multiple databases systematically

Step 3: Screen studies in duplicate against your inclusion criteria

Step 4: Extract effect sizes and their variances

Step 5: Choose a pooling model and a between-study variance estimator

Specify the tau-squared estimator and Hartung-Knapp in the protocol

Extract the variance, not just the point estimate

Never select fixed versus random effects from the Q test

Run influence diagnostics, not just leave-one-out

Frequently Asked Questions

Related Articles

Reading About Meta-Analysis? Our PhD Team Runs Them Every Day.

Dr. James Whitfield

Reading About Meta-Analysis? Our PhD Team Runs Them Every Day.

Step 6: Pool with inverse-variance weighting and build a forest plot

Step 7: Diagnose heterogeneity and publication bias

Step 8: Run sensitivity analyses, rate certainty, and write up

Doing it in R with metafor

Related Articles