Decades of surveys, from Oakes (1986) through Greenland et al. (2016), reveal the same misunderstandings appearing across disciplines, career stages, and even statistics textbooks. Here are the five most common and most consequential misconceptions.
Misconception 1: The p-value is the probability that the null hypothesis is true. This is the single most prevalent error and it inverts the conditional probability. The p-value is P(data | null hypothesis), not P(null hypothesis | data). To calculate the probability that the null hypothesis is true, you would need Bayesian methods with a prior distribution, something a p-value cannot provide. When a researcher writes "there is only a 3% chance the groups are truly equal," they have committed this error. The correct statement is: "there is a 3% probability of data this extreme if the groups were truly equal."
Misconception 2: A non-significant p-value means there is no effect. This is the absence-of-evidence fallacy. A p-value of 0.12 does not mean the treatment has no effect. It means the study failed to detect an effect at the 0.05 threshold. The study might have been underpowered. The effect might be real but smaller than anticipated. To assess whether a non-significant result is informative, you need to examine the confidence interval. If the 95% CI is -0.2 to 12.8, the data are consistent with both no effect and a large effect, the study is simply uninformative, not negative.
Misconception 3: A significant p-value means the effect is large or important. Statistical significance says nothing about clinical significance. With a large enough sample size, a trivially small effect, one that no clinician would consider meaningful, can produce a p-value below 0.001. A blood pressure reduction of 0.3 mmHg might be statistically significant in a trial of 50,000 patients but is clinically irrelevant. This is exactly why confidence intervals matter: they force you to confront the magnitude of the effect, not just its statistical existence.
Misconception 4: P = 0.049 and p = 0.051 are meaningfully different. Treating 0.05 as a bright line that separates "real" findings from "null" findings has no scientific basis. Rosnow and Rosenthal (1989) famously wrote: "Surely God loves the .06 nearly as much as the .05." A p-value is a continuous measure of evidence. The difference between 0.049 and 0.051 is negligible, yet the publication system treats them as categorically distinct, one "significant," the other "not significant." This cliff-edge thinking distorts literature, encourages p-hacking, and undermines the cumulative nature of scientific evidence.
Misconception 5: Replication guarantees the same p-value. P-values are highly variable across replications, especially for small studies. If the true effect produces p = 0.03 in your study, a direct replication might produce p = 0.15 or p = 0.001. Simulations show that p-values from replications of adequately powered studies still bounce around enormously (Cumming, 2008). This variability is another reason to focus on effect sizes and confidence intervals rather than fixating on whether p crossed a threshold. For guidance on how sample size affects the stability of your results, try our interactive power analysis calculator.
Statistical significance tells you whether an observed effect is likely real, that is, unlikely to have arisen by chance alone if the null hypothesis were true. Clinical significance tells you whether that effect is large enough to matter in practice. These are fundamentally different questions, and confusing them is one of the most consequential errors in applied research.
A result can be statistically significant but clinically insignificant. If a new antihypertensive drug reduces systolic blood pressure by 0.5 mmHg compared to placebo, and the trial includes 100,000 participants, the p-value might be less than 0.001. But no physician would prescribe a drug for a half-millimeter reduction. The effect is real but irrelevant.
Conversely, a result can be clinically significant but statistically non-significant. A pilot study of 20 patients might show a 15 mmHg blood pressure reduction, a clearly important effect, but produce p = 0.08 because the sample is too small to achieve conventional significance. Dismissing this finding as "negative" would be misleading. The confidence interval might be -2.0 to 32.0 mmHg, showing the data are consistent with a large and clinically meaningful effect.
The concept of the minimum clinically important difference (MCID) bridges statistical and clinical significance. The MCID is the smallest change in an outcome that patients or clinicians would consider meaningful. For any study, you should compare your confidence interval against the MCID, not just against zero.
| Scenario | Statistical Significance | Clinical Significance | Interpretation |
|---|
| Large effect, adequate sample | Yes (p < 0.05) | Yes (exceeds MCID) | Strong evidence for a meaningful effect |
| Large effect, small sample | No (p > 0.05) | Possibly (wide CI includes MCID) | Underpowered, need larger study |
| Small effect, huge sample | Yes (p < 0.001) | No (below MCID) | Trivial effect detected by overpowering |
| No effect, adequate sample | No (p > 0.05) | No (narrow CI around zero) | Good evidence of no meaningful effect |
Field (2018) argues that effect sizes should be the primary focus of research reporting, with p-values serving only as a secondary check. This view aligns with the growing emphasis on estimation rather than hypothesis testing, a shift reflected in updated reporting guidelines, journal policies, and the ASA statement discussed below.
When evaluating evidence in a systematic review, the GRADE framework explicitly assesses imprecision by examining confidence intervals. A body of evidence is downgraded for imprecision when confidence intervals are wide enough to include both clinically meaningful benefit and clinically meaningful harm. This is why understanding the relationship between CIs and clinical significance is essential for anyone conducting or appraising evidence synthesis. See our guide on how to do a meta-analysis step by step for practical instructions on pooling effect sizes and interpreting their intervals.
How to Report P-Values and CIs (ASA 2016)
In 2016, the American Statistical Association published a landmark statement on statistical significance and p-values (Wasserstein & Lazar, 2016). It was the first time in the organization's 177-year history that it issued a formal position on a specific statistical practice. The statement was motivated by decades of evidence that p-values were being routinely misused, misinterpreted, and over-relied upon in published research.
The ASA statement contains six principles that every researcher should internalize.
Principle 1: P-values can indicate how incompatible the data are with a specified statistical model. The p-value quantifies the degree to which the observed data are inconsistent with the null hypothesis. But this is all it does, and researchers must resist reading more into it.
Principle 2: P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. This directly addresses Misconception 1 above. A p-value is not the probability that you are wrong, and it is not the probability that the null is true.
Principle 3: Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. This principle challenges the binary significant/not-significant paradigm that dominates research culture. A result with p = 0.06 should not be dismissed, and a result with p = 0.04 should not be automatically accepted as proof.
Principle 4: Proper inference requires full reporting and transparency. Cherry-picking results, testing multiple hypotheses without adjustment, and stopping data collection when significance is achieved (p-hacking) all inflate false positive rates. The ASA calls for pre-registration, transparent reporting of all analyses conducted, and honest acknowledgment of limitations.
Principle 5: A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. This reinforces the statistical-vs-clinical significance distinction. A tiny p-value can correspond to a trivially small effect, and a large p-value can coexist with a clinically important one.
Principle 6: By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. The ASA recommends that p-values be supplemented with effect sizes, confidence intervals, likelihood ratios, Bayes factors, or other measures that provide richer information about the data.
Building on these principles, here are practical reporting recommendations for your manuscripts.
Report exact p-values to three decimal places. Write p = 0.032, not "p < 0.05." The only exception is extremely small values, where p < 0.001 is acceptable. Never report p = 0.000, it is always an artifact of rounding.
Always report confidence intervals for primary outcomes. The 95% CI should accompany every effect estimate in your Results section. For ratios (odds ratios, risk ratios, hazard ratios), report the CI on the same scale. For differences (mean differences, risk differences), include the units.
Report effect sizes alongside p-values. Use Cohen's d, Hedges' g, odds ratios, risk ratios, mean differences, or whatever metric is standard in your field. The effect size tells readers what they actually need to know, how large the effect is, while the p-value and CI provide context about certainty.
Never describe results as "trending toward significance." This phrase has no statistical meaning. If p = 0.07, report p = 0.07 and let the reader interpret the evidence in context. Phrases like "marginally significant," "approaching significance," and "a trend toward significance" obscure rather than clarify.
For non-significant results, examine the confidence interval before concluding there is no effect. If the CI is narrow and centered near zero, a null finding is informative. If the CI is wide, the study was simply underpowered and the result is inconclusive, not negative.
When choosing the right statistical test for your research, the test determines the p-value calculation and the corresponding CI construction. Selecting the wrong test invalidates both outputs regardless of how carefully you interpret them.
Meta-analysis brings p-values and confidence intervals into sharp focus because it pools estimates across multiple studies. The forest plot, the signature visualization of a meta-analysis, displays each study's effect estimate as a point with a horizontal line representing its 95% confidence interval. The diamond at the bottom represents the pooled effect estimate and its CI.
In a forest plot, several patterns are immediately visible. Studies with narrow confidence intervals (short horizontal lines) had large samples or low variability and contribute more weight to the pooled estimate. Studies with wide intervals (long horizontal lines) were smaller or more variable and contribute less weight. If a study's CI crosses the line of no effect (typically zero for mean differences or one for odds ratios), that individual study's result is not statistically significant, but it still contributes to the pooled estimate.
The pooled p-value in a meta-analysis is calculated from the overall test of the combined effect. But the pooled confidence interval is far more informative. A pooled effect of OR = 1.45 with 95% CI 1.12 to 1.88 tells you: the combined evidence suggests a 45% increase in odds, with the true value plausibly between 12% and 88% higher. The p-value (say, p = 0.005) adds only that this is unlikely under the null, information already obvious from the CI excluding 1.0.
Heterogeneity complicates interpretation. The I-squared statistic measures what proportion of the variability across studies reflects genuine differences in effect sizes rather than sampling error. High heterogeneity (I-squared above 75%) means the studies are not all estimating the same effect, and the pooled estimate may be misleading. When heterogeneity is high, prediction intervals, which show the range of effects expected in a future study, are more informative than the standard confidence interval around the pooled mean.
The GRADE framework uses confidence intervals explicitly to assess the certainty of evidence. One of GRADE's five domains for downgrading certainty is imprecision, defined by examining whether the confidence interval crosses thresholds of clinical importance. If the 95% CI of the pooled effect includes both appreciable benefit and appreciable harm, or both a meaningful effect and no effect, the evidence is downgraded for imprecision regardless of the p-value.
For example, consider a pooled risk ratio of 0.75 with 95% CI 0.50 to 1.12 and p = 0.16. The p-value says "not significant." But the CI tells a more nuanced story: the data are consistent with a 50% risk reduction (substantial benefit) and also with a 12% risk increase (potential harm). This is an imprecise estimate that warrants a larger trial, not a conclusion that the intervention does not work. GRADE would rate this evidence as low certainty due to imprecision, even though individual study quality might be high.
Subgroup analyses and sensitivity analyses in meta-analysis rely heavily on comparing confidence intervals. When the CIs of two subgroups overlap substantially, the subgroup difference is unlikely to be real, even if one subgroup's result is "significant" and the other is not. A formal interaction test (comparing the difference between subgroups against zero) is required to claim a genuine subgroup effect.
Our guide on biostatistics consulting and when to hire one covers scenarios where the interplay between p-values, confidence intervals, heterogeneity statistics, and GRADE assessments exceeds what most research teams can handle independently. Having a statistician involved from protocol development through manuscript preparation reduces the risk of misinterpretation at every stage.
Understanding p-values and confidence intervals is not optional for researchers who publish quantitative work. These concepts underpin every hypothesis test, every effect estimate, every forest plot, and every evidence summary you will encounter or produce. The shift away from binary significance testing, driven by the ASA statement, updated reporting guidelines, and growing awareness of reproducibility failures, demands that researchers move beyond "p < 0.05 therefore true" and embrace a more nuanced, estimation-centered approach to statistical inference. Report exact p-values, always include confidence intervals, interpret effect sizes against clinical thresholds, and resist the temptation to reduce complex evidence to a single significant-or-not label. Your manuscripts, your reviews, and your clinical decisions will be stronger for it.
If interpreting p-values and confidence intervals feels overwhelming, it may be time to get professional help. Learn when to hire a biostatistician and what it costs.
For categorical data, p-values come from different tests depending on sample size. Learn when to use chi-square versus Fisher's exact test and how each handles small expected cell counts.
Reviewers often challenge your statistical choices. Our guide on responding to statistical reviewer comments provides templates for defending your p-value thresholds and confidence interval interpretations.