Effect sizes are the foundation of every meta-analysis. They translate the raw results from individual studies, which come in different formats, on different scales, and across different populations, into a common metric that can be pooled, compared, and interpreted across an entire body of evidence. Choosing the wrong effect size can invalidate your synthesis. Choosing the right one delivers clear, actionable conclusions that advance clinical knowledge and inform practice.
This guide covers every major effect size used in contemporary meta-analysis, explains when and why to use each one, provides worked decision frameworks, walks through conversion methods between metrics, and addresses the interpretation pitfalls that catch even experienced systematic reviewers.
Why Effect Sizes Are the Currency of Meta-Analysis
Individual studies report their results in a bewildering variety of formats: means and standard deviations, percentages and p-values, regression coefficients, hazard ratios, median survival times, or simple statements like "the intervention group improved significantly." A meta-analysis cannot work with this raw heterogeneity of reporting formats. It needs a standardised measure that captures the direction, magnitude, and precision of each study's finding in a common unit that allows direct comparison and mathematical pooling.
That standardised measure is the effect size, and it simultaneously answers the two most important questions in evidence synthesis. First, in which direction does the evidence point: does the intervention help, harm, or make no difference compared to the comparator? Second, how large is the effect: is it trivially small, clinically meaningful, or transformatively large? Without effect sizes, you cannot construct a forest plot, calculate a pooled estimate, assess heterogeneity, or conduct any of the subgroup or sensitivity analyses that give a meta-analysis its analytical power. They are not a statistical convenience; they are the fundamental unit of evidence synthesis.
The choice of effect size depends on three factors: the type of outcome being measured (continuous, dichotomous, time-to-event, or correlational), the measurement scales used across studies (identical or different instruments), and the study designs contributing data (experimental, observational, case-control). Getting this choice right at the protocol stage prevents painful analytical problems downstream.
Effect Sizes for Continuous Outcomes
When your outcome is measured on a continuous scale (blood pressure, pain scores, cognitive test performance, quality-of-life ratings), you have two primary options: the mean difference (MD) and the standardised mean difference (SMD). The choice between them depends entirely on whether your included studies all use the same measurement instrument.
Mean Difference (MD)
The mean difference is the simplest and most clinically interpretable effect size: the arithmetic difference between the intervention group mean and the control group mean, expressed in the original measurement units.
When to use the mean difference:
- All included studies measure the outcome on the same scale using the same instrument (e.g., systolic blood pressure in mmHg, Hamilton Depression Rating Scale score, forced expiratory volume in litres).
- The measurement scale has a clinically meaningful interpretation that your audience understands intuitively.
- You want readers to be able to directly translate your pooled result into clinical practice without requiring statistical training to interpret standardised units.
Advantages of the mean difference: Direct clinical interpretability is the primary strength. A pooled MD of -5.2 mmHg for blood pressure is immediately meaningful to clinicians, patients, and policymakers. No information is lost through standardisation, and the result requires no additional context for interpretation.
Limitations: The mean difference cannot be used when studies measure the same underlying construct using different instruments. You cannot meaningfully average a Beck Depression Inventory score change with a PHQ-9 score change in their original units, even though both measure depression severity.
Standardised Mean Difference (SMD)
The standardised mean difference expresses the difference between group means in standard deviation units rather than the original measurement units. This standardisation allows you to pool results from studies using different measurement instruments for the same underlying construct. The two versions you will encounter are Cohen's d and Hedges' g.
Cohen's d divides the mean difference by the pooled standard deviation of both groups:
d = (Mean_intervention - Mean_control) / SD_pooled
Hedges' g applies a small-sample correction factor to Cohen's d that adjusts for the upward bias present when sample sizes are small:
g = d × J, where J ≈ 1 - (3 / (4df - 1))
For studies with more than approximately 20 participants per group, Cohen's d and Hedges' g are nearly identical. For smaller samples, Hedges' g provides a less biased estimate and is the default choice in most meta-analysis software.
Pro Tip: Always use Hedges' g instead of Cohen's d as your default SMD. The correction factor adds negligible computational complexity (your software handles it automatically) but eliminates the systematic upward bias that Cohen's d exhibits with small samples. Since systematic reviews frequently include studies with modest sample sizes, defaulting to Hedges' g is a costless safeguard that improves accuracy. Every major meta-analysis software package (RevMan, Comprehensive Meta-Analysis, R metafor) offers Hedges' g as a standard option.
Interpreting Standardised Mean Differences
Cohen's widely cited benchmarks provide a starting framework for SMD interpretation, but they should always be contextualised within your specific clinical question:
| SMD Magnitude | Cohen's Label | Clinical Context Example |
|---|---|---|
| 0.2 | Small | A new antihypertensive lowers systolic BP by 0.2 SD, modest but potentially meaningful if the drug has a favourable safety profile |
| 0.5 | Medium | A psychotherapy intervention improves depression scores by 0.5 SD, a clearly noticeable clinical improvement |
| 0.8 | Large | A surgical intervention improves functional outcomes by 0.8 SD, a substantial, easily observable change |
| 1.2+ | Very large | Rare in clinical research; common in laboratory or educational interventions |
The critical caveat: these benchmarks are generic defaults, not clinical thresholds. An SMD of 0.2 in a life-threatening condition where no effective treatment exists may be profoundly meaningful. An SMD of 0.8 in a self-reported subjective outcome with known measurement bias may be less impressive than it appears. Always interpret SMDs in the context of the baseline severity, the patient population, the clinical significance threshold, and the precision of the estimate.
Choosing Between MD and SMD
Use the mean difference whenever possible because it preserves direct clinical interpretability. Reserve the standardised mean difference for situations where pooling requires standardisation across different instruments. If most of your included studies use the same measurement scale but a small number use different instruments, consider converting those outlier results to the common scale (if conversion formulas exist) rather than standardising everything. This preserves interpretability for the majority of your evidence.
Effect Sizes for Dichotomous Outcomes
When your outcome is binary (event or no event, response or no response, alive or dead), the three primary options are the risk ratio (RR), odds ratio (OR), and risk difference (RD). Each has distinct strengths, limitations, and appropriate use cases, and confusing them is one of the most common errors in meta-analysis reporting and interpretation.
Risk Ratio (Relative Risk)
The risk ratio compares the probability of the event in the intervention group to the probability in the control group:
RR = (Events_intervention / N_intervention) / (Events_control / N_control)
When to use the risk ratio: Cohort studies and randomised controlled trials where you can directly estimate the incidence of the event in both groups. The risk ratio is the preferred measure for most intervention meta-analyses because it is the most intuitive relative measure for clinicians and patients.
Interpretation framework:
| RR Value | Meaning | Clinical Example |
|---|---|---|
| RR = 1.0 | No difference between groups | The intervention has no effect on the outcome |
| RR = 0.75 | 25% relative risk reduction | The intervention reduces the event rate by one quarter |
| RR = 1.50 | 50% relative risk increase | The intervention increases the event rate by half |
| RR = 0.50 | 50% relative risk reduction | The intervention halves the event rate |
The risk ratio is bounded by zero on the lower end but has no upper bound, and its natural value of no effect is 1.0. It is typically log-transformed for meta-analytic pooling (because the log-RR has a more symmetric sampling distribution) and then back-transformed for reporting.
Odds Ratio
The odds ratio compares the odds (not the probability) of the event between groups:
OR = (Events_intervention / Non-events_intervention) / (Events_control / Non-events_control)
When to use the odds ratio: Case-control studies where you cannot directly estimate incidence, logistic regression models (which naturally output odds ratios), and situations involving rare events (below approximately 10% prevalence) where OR and RR converge to nearly identical values.
The critical caveat that every meta-analyst must understand: When baseline event rates exceed approximately 10%, odds ratios systematically overestimate the corresponding risk ratio. This divergence grows as the event rate increases. An OR of 3.0 with a 30% baseline risk corresponds to an RR of approximately 2.0, which represents a 50% overestimation of the relative risk. Never describe or interpret an odds ratio as though it were a risk ratio when the outcome is common. This single error is responsible for more misinterpretation in published meta-analyses than almost any other statistical mistake.
| Baseline Event Rate | Odds Ratio | Corresponding Risk Ratio | Overestimation |
|---|---|---|---|
| 5% (rare) | 2.0 | 1.95 | Minimal (~3%) |
| 10% | 2.0 | 1.82 | ~10% |
| 20% | 2.0 | 1.67 | ~20% |
| 30% | 2.0 | 1.54 | ~30% |
| 50% (common) | 2.0 | 1.33 | ~50% |
Risk Difference (Absolute Risk Reduction)
The risk difference is the absolute difference in event rates between groups:
RD = (Events_intervention / N_intervention) - (Events_control / N_control)
When to use the risk difference: When you need to communicate the absolute clinical impact of an intervention, calculate the Number Needed to Treat (NNT = 1 / |RD|), or help decision-makers understand how many events would be prevented per population treated. The risk difference contextualises relative effects within the baseline risk, which is essential for clinical decision-making.
Limitation for meta-analytic pooling: Risk differences tend to be more heterogeneous across studies than relative measures because they depend directly on the baseline event rate, which varies across populations and settings. For this reason, most meta-analyses use a relative measure (RR or OR) as the primary pooled estimate and present the risk difference or NNT as a supplementary measure for clinical interpretation.
Choosing Between RR, OR, and RD
For most meta-analyses of intervention effects, follow this decision framework:
- Primary measure: Use the risk ratio because it is intuitive, directly interpretable, and appropriate for cohort studies and RCTs.
- Case-control studies: Use the odds ratio because incidence cannot be estimated directly from case-control data.
- Supplementary measure: Report the risk difference and/or NNT alongside the primary relative measure to provide clinical context about absolute impact.
- Mixed study designs with rare events: Either RR or OR is acceptable since they converge when prevalence is below 10%.
- Mixed study designs with common events: Convert to a common metric using documented conversion formulas and present sensitivity analyses using unconverted values.
Pro Tip: Always check the direction of your effect sizes before pooling. A surprisingly common error in meta-analysis is accidentally combining effect sizes with reversed directions. If one study reports the mean difference as intervention minus control and another reports control minus intervention, your pooled estimate will be biased toward the null or even reversed. Create a consistent coding scheme during data extraction that defines the direction for all effect sizes, and verify the direction of every extracted value during data checking. This five-minute quality check has caught errors in countless published meta-analyses.
Effect Sizes for Time-to-Event Outcomes
Hazard Ratio (HR)
The hazard ratio is the standard effect size for time-to-event (survival) data. It represents the ratio of hazard rates between two groups, where the hazard is the instantaneous rate of the event at any given time point, conditional on the participant having survived to that point.
When to use the hazard ratio: Studies reporting survival analysis or time-to-event outcomes where censoring is present (participants lost to follow-up, studies ending before all events occur). Common applications include overall survival, progression-free survival, time to relapse, time to treatment failure, and duration of response in oncology, cardiology, and infectious disease research.
Interpretation framework:
| HR Value | Meaning (for harmful events) |
|---|---|
| HR = 1.0 | No difference in event rates between groups |
| HR = 0.70 | 30% reduction in the hazard rate (favours intervention) |
| HR = 1.40 | 40% increase in the hazard rate (favours control) |
| HR = 0.50 | 50% reduction in the hazard rate (strong intervention benefit) |
The hazard ratio assumes proportional hazards, meaning that the relative difference in event rates between groups remains constant over time. When this assumption is violated (e.g., a treatment that provides early benefit but loses effectiveness over time), the HR can be misleading and alternative approaches such as restricted mean survival time (RMST) should be considered.
Extracting Hazard Ratios From Published Studies
Not all studies report hazard ratios directly, which makes extraction one of the most challenging aspects of time-to-event meta-analysis. The approach depends on what data the study provides:
| Available Data | Extraction Method | Precision |
|---|---|---|
| HR with 95% CI reported directly | Extract as-is, convert to ln(HR) and SE for pooling | Highest |
| HR with p-value only | Calculate SE from p-value using the normal distribution | High |
| Kaplan-Meier curves with numbers at risk | Use Parmar/Tierney methods to estimate HR from curve data | Moderate |
| Median survival times for both groups | Approximate HR from ratio of median times (assumes exponential distribution) | Lower |
| Only event counts and p-values | Use approximation methods from Tierney et al. (2007) | Lower |
The Tierney et al. (2007) practical methods paper and Cochrane Handbook Chapter 6 provide step-by-step extraction guidance for each scenario. Always document which extraction method you used for each study in a supplementary table so that reviewers can evaluate the potential impact of different extraction approaches on your pooled estimate.
Pro Tip: Extract data at the most granular level available. When studies report both adjusted and unadjusted hazard ratios, extract both and document which you use in your primary analysis versus sensitivity analysis. When individual patient data or detailed Kaplan-Meier curves are available alongside summary statistics, use the most granular source to maximise precision. And always record the covariates included in adjusted analyses, as pooling hazard ratios adjusted for different covariate sets introduces heterogeneity that should be explored.
Correlation Coefficients as Effect Sizes
Pearson's r and Fisher's z Transformation
The Pearson correlation coefficient measures the linear association between two continuous variables, ranging from -1 (perfect negative association) through 0 (no association) to +1 (perfect positive association). It is the standard effect size in meta-analyses that examine the strength of association between variables rather than the effect of an intervention. This measure is common in psychology, education, organisational behaviour, and social science research.
The pooling problem with raw correlations: The sampling distribution of Pearson's r is skewed, particularly when the true correlation is far from zero, and its variance depends on the correlation value itself. This means you should not average raw correlation coefficients directly. Instead, apply Fisher's z transformation before pooling:
z = 0.5 × ln((1 + r) / (1 - r))
Pool the z-transformed values using standard inverse-variance weighting, then back-transform the pooled z to obtain the pooled correlation coefficient. Fisher's z has a nearly normal sampling distribution regardless of the true correlation, which makes standard meta-analytic methods appropriate.
Benchmarks for interpreting correlations (Cohen, 1988):
| Correlation (r) | Label | Research Example |
|---|---|---|
| 0.10 | Small | Association between a single personality trait and job performance |
| 0.30 | Medium | Association between socioeconomic status and academic achievement |
| 0.50 | Large | Association between height and weight in adults |
As with all effect size benchmarks, these are starting points for interpretation, not absolute thresholds. A "small" correlation observed consistently across dozens of high-quality studies may be more scientifically meaningful than a "large" correlation from a single underpowered study with measurement problems.
Converting Between Effect Size Metrics
Sometimes your included studies report different types of effect sizes for comparable outcomes, requiring conversion to a common metric before pooling. Established conversion formulas exist for the most common scenarios, but every conversion introduces assumptions that must be documented and tested.
Common Conversion Formulas
SMD to Odds Ratio (Hasselblad-Hedges formula):
ln(OR) = d × (π / √3) ≈ d × 1.814
Odds Ratio to SMD:
d = ln(OR) × (√3 / π) ≈ ln(OR) × 0.551
Odds Ratio to Risk Ratio (requires baseline event rate P₀):
RR = OR / (1 - P₀ + P₀ × OR)
Correlation to SMD:
d = 2r / √(1 - r²)
Important Caveats for Conversions
Every conversion formula relies on distributional assumptions that may not hold perfectly in practice. The SMD-to-OR conversion assumes logistic distributions in both groups. The OR-to-RR conversion requires accurate knowledge of the baseline event rate. The r-to-d conversion assumes equal group sizes. These assumptions introduce additional uncertainty beyond the sampling error of the original estimates.
Pro Tip: Document every conversion in a supplementary table. For every study where you converted between effect size metrics, create a row in a supplementary table showing the original reported value, the conversion formula applied, the converted value, and any assumptions required (such as the baseline event rate used for OR-to-RR conversion). This makes your analysis fully reproducible and allows reviewers to verify your calculations. Additionally, conduct a sensitivity analysis using only unconverted effect sizes to assess whether the conversions meaningfully influence your pooled estimate.
Understanding and Investigating Heterogeneity
Once you have pooled your effect sizes, you must assess heterogeneity, the extent to which the true underlying effect varies across studies beyond what would be expected from random sampling error alone. A meta-analysis that reports a pooled effect without assessing heterogeneity is fundamentally incomplete and potentially misleading.
Key Heterogeneity Statistics
| Statistic | What It Measures | Interpretation |
|---|---|---|
| Cochran's Q | Whether observed variability exceeds expected sampling error | A formal hypothesis test; significant p-value suggests true heterogeneity exists |
| I² | The percentage of total variability attributable to true heterogeneity | 0–40% might not be important; 30–60% moderate; 50–90% substantial; 75–100% considerable |
| τ² (tau-squared) | The absolute between-study variance in the true effect | Expressed in squared effect-size units; useful for comparing heterogeneity across different meta-analyses |
| Prediction interval | The range within which a future study's true effect is likely to fall | The most clinically meaningful heterogeneity measure; often much wider than the confidence interval |
Note the deliberately overlapping I² ranges in the Cochrane guidance. This reflects the fact that I² must be interpreted alongside the Q test p-value, the tau-squared estimate, and the clinical context. A meta-analysis with I² of 60% where all individual studies show benefit in the same direction is very different from one with I² of 60% where studies show conflicting directions of effect.
Pro Tip: Report prediction intervals alongside confidence intervals in every forest plot. The confidence interval for a pooled effect tells you the precision of the average estimate across the included studies. The prediction interval tells you the range within which a future study's true effect is likely to fall given the observed heterogeneity. In the presence of substantial heterogeneity, prediction intervals are often dramatically wider than confidence intervals and provide a far more honest picture of what clinicians should expect when applying the evidence to new settings and populations. Both intervals appear in the same forest plot row and cost nothing additional to compute.
Investigating the Sources of Heterogeneity
When heterogeneity is present, your job is not simply to report its magnitude but to investigate its sources. Pre-specified investigations should include:
Subgroup analyses compare effect sizes across predefined categories. Common subgroup variables include study quality (low vs. high risk of bias), population characteristics (age, severity, geographic region), intervention features (dose, duration, delivery mode), and study design (RCT vs. observational). Limit the number of subgroup analyses to avoid false-positive findings. A commonly cited guideline is no more than one subgroup variable per ten included studies.
Meta-regression models the association between continuous study-level covariates (e.g., mean age, intervention duration, publication year) and the effect size. Meta-regression requires at least ten studies to have adequate statistical power, and its results should be interpreted as exploratory and hypothesis-generating rather than confirmatory.
Sensitivity analyses test the robustness of your pooled estimate by systematically varying analytical decisions. Common sensitivity analyses include removing outlier studies (identified by visual inspection of the forest plot or formal statistical tests), restricting to studies at low risk of bias, using alternative effect size calculations, switching between random-effects and fixed-effect models, and excluding studies where effect sizes were converted from a different metric.
Reporting Effect Sizes: A Comprehensive Checklist
A well-reported meta-analysis presents effect sizes with sufficient context and detail for readers to independently evaluate the evidence. PRISMA 2020 mandates all of these reporting elements, and incomplete effect-size reporting is one of the most common reasons for revision requests at high-impact journals.
Essential Reporting Elements
| Element | What to Report | Why It Matters |
|---|---|---|
| Effect measure | Name and justification (e.g., "We used Hedges' g because studies used different depression instruments") | Allows readers to assess appropriateness |
| Pooled estimate with 95% CI | The primary result (e.g., SMD = 0.45, 95% CI 0.28 to 0.62) | Quantifies the best estimate and its precision |
| Prediction interval | The expected range for a new study (e.g., PI -0.12 to 1.02) | Contextualises heterogeneity clinically |
| Forest plot | Individual study effects with weights and pooled diamond | Visual display of the evidence structure |
| Heterogeneity statistics | Q, I², τ², and prediction interval | Assesses consistency across studies |
| Subgroup and sensitivity results | Pre-specified and post-hoc analyses with rationale | Tests robustness and explores variability |
| Publication bias assessment | Funnel plot and statistical test if ≥10 studies | Evaluates risk of missing studies |
| GRADE certainty rating | Certainty for each primary outcome with justification | Contextualises the strength of evidence |
Choosing the Right Effect Size: A Decision Framework
When selecting your effect size at the protocol stage, work through this decision tree to ensure you choose the most appropriate and informative measure for your specific review question.
Step 1: Identify Your Outcome Type
| Outcome Type | Go To |
|---|---|
| Continuous (measured on a scale) | Step 2 |
| Dichotomous (event / no event) | Step 3 |
| Time-to-event (survival data with censoring) | Use Hazard Ratio |
| Correlational (association between variables) | Use Fisher z-transformed Pearson r |
Step 2: Continuous Outcome: Same Scale or Different Scales?
| Scenario | Effect Size | Rationale |
|---|---|---|
| All studies use the same instrument and scale | Mean Difference (MD) | Preserves direct clinical interpretability |
| Studies use different instruments for the same construct | Standardised Mean Difference (Hedges' g) | Enables pooling across different scales |
| Most studies use the same scale, a few use different instruments | Convert outliers to the common scale if possible; otherwise use Hedges' g | Balances interpretability with inclusiveness |
Step 3: Dichotomous Outcome: Which Study Designs?
| Scenario | Primary Measure | Supplementary Measure |
|---|---|---|
| RCTs and cohort studies | Risk Ratio (RR) | Risk Difference and NNT for clinical context |
| Case-control studies | Odds Ratio (OR) | Convert to RR with baseline risk for interpretation |
| Mixed designs, rare events (<10%) | Either RR or OR (they converge) | Risk Difference |
| Mixed designs, common events (>10%) | Convert to a common metric with documented assumptions | Sensitivity analysis with unconverted values |
Common Mistakes That Undermine Meta-Analyses
Even experienced systematic reviewers make errors in effect-size selection, calculation, and interpretation. Being aware of these common pitfalls allows you to avoid them in your own work and identify them when peer-reviewing others' meta-analyses.
-
Pooling different constructs under the same label. An effect size for pain intensity and one for pain frequency cannot be combined just because both are called "pain outcomes." Each pooled analysis must include only studies measuring fundamentally the same construct with conceptually equivalent endpoints.
-
Ignoring the direction of effect sizes. Ensure that a positive SMD (or RR greater than 1.0) consistently means the same thing across all included studies before pooling. A single reversed-direction effect size can substantially bias the pooled estimate toward the null.
-
Using Cohen's d with small samples. Switch to Hedges' g whenever any included study has fewer than 20 participants per group. The correction factor is trivial to compute and eliminates systematic upward bias.
-
Treating odds ratios as risk ratios. When event rates exceed 10%, the OR overestimates the RR by an increasingly large margin. Always state clearly which measure you report and convert to the appropriate metric for interpretation.
-
Averaging raw correlations without Fisher's z transformation. Raw correlations have skewed sampling distributions. Always transform to Fisher's z before pooling and back-transform the pooled estimate for reporting.
-
Reporting pooled estimates without heterogeneity assessment. A pooled effect size without I², tau-squared, and ideally a prediction interval is incomplete. A tight confidence interval can mask enormous between-study variability that a prediction interval would reveal.
-
Relying solely on statistical significance. An effect size can be statistically significant (p < 0.05) but clinically trivial, or non-significant but clinically meaningful given the sample size. Always report and interpret the magnitude of the effect, the width of the confidence interval, and the certainty of evidence, not just the p-value.
By understanding these fundamentals, from selecting the right metric through calculating, pooling, and interpreting effect sizes with appropriate nuance, you can produce a meta-analysis that is both statistically rigorous and clinically meaningful. The time invested in getting your effect-size decisions right at the protocol stage pays dividends throughout every subsequent phase of your systematic review.