Choosing the right statistical test, as recommended by the Cochrane Handbook and leading biostatistics textbooks, is one of the most consequential decisions in any research project. The statistical test you select evaluates group differences, measures association strength, or models predictions, and it directly determines whether your p-values, confidence intervals, and conclusions are valid. Select the wrong test and you risk producing results that mislead readers, fail peer review, or worse, inform clinical practice based on flawed inference.
Yet the decision is not as complicated as it seems. Every statistical test selection follows the same logic: identify what you are trying to measure, characterize your data, count your groups, and verify your assumptions. This guide provides a structured statistical test decision tree that walks you through those four questions and maps you to the correct test every time.
Whether you are comparing two treatment arms, examining correlations between variables, or analyzing categorical survey responses, this framework covers the tests you will encounter most frequently in health, social science, and biomedical research. For a deeper understanding of interpreting results once you have chosen your test, see our p-value interpretation guide.
Why Choosing the Right Statistical Test Matters
Statistical tests are not interchangeable. Each test is built on specific mathematical assumptions about your data, its distribution, its measurement scale, its independence structure. When those assumptions hold, the test produces accurate probability estimates. When they do not, the test produces misleading p-values that can inflate or deflate your apparent findings.
The consequences are real. A researcher who runs multiple independent t-tests instead of a one-way ANOVA inflates the Type I Error rate, the probability of declaring a significant result when none exists. With five pairwise comparisons at alpha = 0.05, the cumulative false positive rate climbs to roughly 23%. In clinical research, that kind of error can lead to adoption of ineffective treatments or abandonment of promising ones.
The most common error we see is researchers using multiple t-tests instead of ANOVA, inflating false positive rates without realizing it. A single ANOVA with post-hoc corrections handles the same comparison while maintaining the nominal error rate. Field (2018) emphasizes that test selection is not a matter of preference but of mathematical necessity, each test answers a specific type of question under specific data conditions.
Altman (1991) demonstrated that a substantial proportion of published biomedical research contains statistical errors traceable to incorrect test selection. These errors survive peer review because reviewers focus on clinical content and may not scrutinize analytical choices closely. Understanding which statistical test to use protects your work from this category of preventable error and strengthens the credibility of your findings at every stage, from internal review through journal publication.
Getting the test right also matters for reproducibility. When another researcher attempts to replicate your study, they need to apply the same analytical framework. If your original test choice was inappropriate, replication attempts will produce different results even with identical data, undermining confidence in the original findings. Statistical test selection is a core component of methodological rigor.
The 4 Questions Before Choosing a Statistical Test
Before consulting any decision tree or flowchart, answer four questions about your study. These four questions narrow the entire universe of statistical tests down to one or two candidates.
Question 1: What Is Your Research Question?
Research questions fall into three broad categories, and each category points to a different family of tests.
| Research Question Type | What You Are Asking | Test Family |
|---|---|---|
| Comparison | Are groups different from each other? | t-test, ANOVA, Mann-Whitney, Kruskal-Wallis |
| Relationship | Are two variables associated? | Pearson, Spearman, chi-square |
| Prediction | Can one variable predict another? | Linear regression, logistic regression |
A comparison question asks whether an outcome differs between groups, for example, "Is blood pressure lower in the treatment group than the control group?" A relationship question asks whether two variables move together, "Is there an association between exercise frequency and cholesterol level?" A prediction question asks whether one variable can forecast another, "Does BMI predict diabetes risk after controlling for age and sex?"
Start here. The research question determines the analysis. If you are unclear on your question type, revisit your hypothesis statement before proceeding.
Question 2: What Type of Data Do You Have?
The measurement scale of your outcome variable is the single most important data characteristic for test selection.
Continuous data (interval or ratio scale) includes measurements like blood pressure in mmHg, reaction time in milliseconds, or income in dollars. These variables have meaningful numerical distances between values and support arithmetic operations.
Ordinal data consists of ranked categories where the order matters but the distances between ranks are not necessarily equal, pain scales (mild, moderate, severe), Likert responses (strongly agree to strongly disagree), or cancer staging (I, II, III, IV).
Categorical data (nominal) includes unordered categories like treatment group (drug A vs. drug B vs. placebo), disease status (present vs. absent), or blood type (A, B, AB, O). Categorical outcomes require fundamentally different tests than continuous outcomes.
Question 3: How Many Groups Are You Comparing?
If your research question involves comparison, the number of groups determines whether you use a two-sample test or a multi-sample test.
Two groups, Use a t-test (parametric) or Mann-Whitney U (non-parametric). If the same subjects are measured twice (before and after), use a paired test.
Three or more groups, Use ANOVA (parametric) or Kruskal-Wallis (non-parametric). Never run multiple pairwise t-tests; this inflates your false positive rate as discussed above.
Question 4: Do Your Data Meet the Test's Assumptions?
Every parametric test assumes that your data follow a Normal Distribution, that variances are approximately equal across groups, and that observations are independent. If any assumption is violated, you either transform the data, use a robust variant, or switch to a non-parametric alternative.
Non-parametric tests relax distributional assumptions. They work on ranks rather than raw values, making them valid regardless of whether your data are normally distributed. The tradeoff is slightly reduced statistical power when the data actually are normal, but that reduction is typically small (around 5% for the Mann-Whitney relative to the t-test under normality).
Answering these four questions takes you from hundreds of possible tests to a single correct choice. The sections below walk through each test family in detail.
Statistical Tests for Comparing Groups
Comparison tests evaluate whether observed differences between groups are likely to reflect real effects or are consistent with chance variation. The statistical test evaluates group differences by comparing observed effect sizes against what would be expected under the null hypothesis. Your choice depends on the number of groups, data type, and assumption status.
Independent Samples T-test
The t-test compares the means of two independent groups. It is the workhorse of two-group comparisons in experimental and clinical research, treatment versus control, male versus female, intervention versus standard care.
Assumptions: Continuous outcome variable, normally distributed data in each group (or n > 30 per group by central limit theorem), approximately equal variances (testable with Levene's Test), and independent observations.
When to use: You have two independent groups and a continuous outcome that is approximately normally distributed. Examples include comparing mean blood pressure between drug and placebo groups, or comparing test scores between two teaching methods.
Variants: The independent samples t-test compares two separate groups. The one-sample t-test compares a single group's mean to a known value. Welch's t-test is a modification that does not assume equal variances and is increasingly recommended as the default.
Effect size: Report Cohen's d alongside the p-value. A significant p-value tells you the difference is unlikely due to chance; Cohen's d tells you whether the difference is large enough to matter practically.
Mann-Whitney U Test
The Mann-Whitney U test is the non-parametric alternative to the independent samples t-test. It compares the distributions of two independent groups by ranking all observations and testing whether ranks are distributed evenly between groups.
When to use: Your outcome is ordinal (e.g., pain severity on a 1-10 scale), your continuous data violate normality with small sample sizes, or you have significant outliers that would distort the t-test. The Mann-Whitney is also appropriate when you cannot verify the normality assumption due to very small samples (n < 15 per group).
Interpretation: A significant Mann-Whitney result indicates that one group tends to have higher values than the other, it tests stochastic dominance rather than mean difference. Report the U statistic, the p-value, and the rank-biserial correlation as an effect size measure.
One-Way ANOVA
ANOVA (Analysis of Variance) extends the two-group comparison to three or more groups. It tests whether at least one group mean differs significantly from the others. ANOVA compares multiple group means simultaneously while controlling the family-wise error rate, something that multiple t-tests cannot do.
Assumptions: Continuous outcome, normally distributed residuals, homogeneity of variance across groups (Levene's test), and independence. ANOVA is robust to moderate violations of normality when group sizes are equal and reasonably large (n > 20 per group).
Post-hoc tests: A significant ANOVA tells you that at least one group differs but does not identify which groups. Follow up with post-hoc pairwise comparisons, Tukey's HSD (controls for all pairwise comparisons), Bonferroni (conservative), or Games-Howell (when variances are unequal).
Variants: One-way ANOVA compares groups on a single factor. Two-way ANOVA examines two factors simultaneously and their interaction. Repeated measures ANOVA handles within-subjects designs where the same participants are measured multiple times. ANCOVA adds continuous covariates to control for confounding variables.
To determine required sample size for your ANOVA before data collection, run a power analysis specifying the number of groups, expected effect size, and desired power level.
Kruskal-Wallis Test
The Kruskal-Wallis test is the non-parametric equivalent of one-way ANOVA. It compares the distributions of three or more independent groups using ranks.
When to use: Your outcome is ordinal, your continuous data are non-normal across groups, or sample sizes are too small to rely on ANOVA's robustness to non-normality. Common applications include comparing satisfaction ratings across three treatment protocols or comparing pain levels across four dosage groups.
Follow-up: A significant Kruskal-Wallis test indicates that at least one group differs. Use Dunn's test with Bonferroni correction for pairwise comparisons to identify which specific groups differ.
Wilcoxon Signed-Rank Test
The Wilcoxon signed-rank test is the non-parametric alternative to the paired t-test. It compares two related measurements, typically before and after an intervention on the same subjects, without assuming normality.
When to use: You have paired or matched data (the same subjects measured at two time points, or matched case-control pairs) and the difference scores are not normally distributed, or you have an ordinal outcome measured at two time points.
Interpretation: The Wilcoxon test evaluates whether the median difference between paired observations is significantly different from zero. Report the W statistic (or T statistic, depending on software), the p-value, and the matched-pairs rank-biserial correlation as the effect size.
The table below summarizes all comparison tests and their selection criteria.
| Scenario | Parametric Test | Non-Parametric Alternative | Key Assumption Check |
|---|---|---|---|
| 2 independent groups, continuous | Independent t-test | Mann-Whitney U | Normality + equal variance |
| 2 related groups, continuous | Paired t-test | Wilcoxon signed-rank | Normality of differences |
| 3+ independent groups, continuous | One-way ANOVA | Kruskal-Wallis | Normality + homogeneity |
| 3+ related groups, continuous | Repeated measures ANOVA | Friedman test | Sphericity (Mauchly's test) |
Not sure which statistical test fits your research design? Our biostatisticians help researchers select and run the right analyses, from simple comparisons to complex mixed models. claim your free research assessment, or explore our biostatistics consulting services.