Choosing the right statistical test is one of the most consequential decisions in any research project. The statistical test you select evaluates group differences, measures association strength, or models predictions, and it directly determines whether your p-values, confidence intervals, and conclusions are valid. Select the wrong test and you risk producing results that mislead readers, fail peer review, or worse, inform clinical practice based on flawed inference.
Yet the decision is not as complicated as it seems. Every statistical test selection follows the same logic: identify what you are trying to measure, characterize your data, count your groups, and verify your assumptions. This guide provides a structured statistical test decision tree that walks you through those four questions and maps you to the correct test every time.
Whether you are comparing two treatment arms, examining correlations between variables, or analyzing categorical survey responses, this framework covers the tests you will encounter most frequently in health, social science, and biomedical research. For a deeper understanding of interpreting results once you have chosen your test, see our p-value interpretation guide.
Why Choosing the Right Statistical Test Matters
Statistical tests are not interchangeable. Each test is built on specific mathematical assumptions about your data, its distribution, its measurement scale, its independence structure. When those assumptions hold, the test produces accurate probability estimates. When they do not, the test produces misleading p-values that can inflate or deflate your apparent findings.
The consequences are real. A researcher who runs multiple independent t-tests instead of a one-way ANOVA inflates the Type I Error rate, the probability of declaring a significant result when none exists. With five pairwise comparisons at alpha = 0.05, the cumulative false positive rate climbs to roughly 23%. In clinical research, that kind of error can lead to adoption of ineffective treatments or abandonment of promising ones.
The most common error we see is researchers using multiple t-tests instead of ANOVA, inflating false positive rates without realizing it. A single ANOVA with post-hoc corrections handles the same comparison while maintaining the nominal error rate. Field (2018) emphasizes that test selection is not a matter of preference but of mathematical necessity, each test answers a specific type of question under specific data conditions.
Altman (1991) demonstrated that a substantial proportion of published biomedical research contains statistical errors traceable to incorrect test selection. These errors survive peer review because reviewers focus on clinical content and may not scrutinize analytical choices closely. Understanding which statistical test to use protects your work from this category of preventable error and strengthens the credibility of your findings at every stage, from internal review through journal publication.
Getting the test right also matters for reproducibility. When another researcher attempts to replicate your study, they need to apply the same analytical framework. If your original test choice was inappropriate, replication attempts will produce different results even with identical data, undermining confidence in the original findings. Statistical test selection is a core component of methodological rigor.
The 4 Questions Before Choosing a Statistical Test
Before consulting any decision tree or flowchart, answer four questions about your study. These four questions narrow the entire universe of statistical tests down to one or two candidates.
Question 1: What Is Your Research Question?
Research questions fall into three broad categories, and each category points to a different family of tests.
| Research Question Type | What You Are Asking | Test Family |
|---|---|---|
| Comparison | Are groups different from each other? | t-test, ANOVA, Mann-Whitney, Kruskal-Wallis |
| Relationship | Are two variables associated? | Pearson, Spearman, chi-square |
| Prediction | Can one variable predict another? | Linear regression, logistic regression |
A comparison question asks whether an outcome differs between groups, for example, "Is blood pressure lower in the treatment group than the control group?" A relationship question asks whether two variables move together, "Is there an association between exercise frequency and cholesterol level?" A prediction question asks whether one variable can forecast another, "Does BMI predict diabetes risk after controlling for age and sex?"
Start here. The research question determines the analysis. If you are unclear on your question type, revisit your hypothesis statement before proceeding.
Question 2: What Type of Data Do You Have?
The measurement scale of your outcome variable is the single most important data characteristic for test selection.
Continuous data (interval or ratio scale) includes measurements like blood pressure in mmHg, reaction time in milliseconds, or income in dollars. These variables have meaningful numerical distances between values and support arithmetic operations.
Ordinal data consists of ranked categories where the order matters but the distances between ranks are not necessarily equal, pain scales (mild, moderate, severe), Likert responses (strongly agree to strongly disagree), or cancer staging (I, II, III, IV).
Categorical data (nominal) includes unordered categories like treatment group (drug A vs. drug B vs. placebo), disease status (present vs. absent), or blood type (A, B, AB, O). Categorical outcomes require fundamentally different tests than continuous outcomes.
Question 3: How Many Groups Are You Comparing?
If your research question involves comparison, the number of groups determines whether you use a two-sample test or a multi-sample test.
Two groups, Use a t-test (parametric) or Mann-Whitney U (non-parametric). If the same subjects are measured twice (before and after), use a paired test.
Three or more groups, Use ANOVA (parametric) or Kruskal-Wallis (non-parametric). Never run multiple pairwise t-tests; this inflates your false positive rate as discussed above.
Question 4: Do Your Data Meet the Test's Assumptions?
Every parametric test assumes that your data follow a Normal Distribution, that variances are approximately equal across groups, and that observations are independent. If any assumption is violated, you either transform the data, use a robust variant, or switch to a non-parametric alternative.
Non-parametric tests relax distributional assumptions. They work on ranks rather than raw values, making them valid regardless of whether your data are normally distributed. The tradeoff is slightly reduced statistical power when the data actually are normal, but that reduction is typically small (around 5% for the Mann-Whitney relative to the t-test under normality).
Answering these four questions takes you from hundreds of possible tests to a single correct choice. The sections below walk through each test family in detail.
Statistical Tests for Comparing Groups
Comparison tests evaluate whether observed differences between groups are likely to reflect real effects or are consistent with chance variation. The statistical test evaluates group differences by comparing observed effect sizes against what would be expected under the null hypothesis. Your choice depends on the number of groups, data type, and assumption status.
Independent Samples T-test
The t-test compares the means of two independent groups. It is the workhorse of two-group comparisons in experimental and clinical research, treatment versus control, male versus female, intervention versus standard care.
Assumptions: Continuous outcome variable, normally distributed data in each group (or n > 30 per group by central limit theorem), approximately equal variances (testable with Levene's Test), and independent observations.
When to use: You have two independent groups and a continuous outcome that is approximately normally distributed. Examples include comparing mean blood pressure between drug and placebo groups, or comparing test scores between two teaching methods.
Variants: The independent samples t-test compares two separate groups. The one-sample t-test compares a single group's mean to a known value. Welch's t-test is a modification that does not assume equal variances and is increasingly recommended as the default.
Effect size: Report Cohen's d alongside the p-value. A significant p-value tells you the difference is unlikely due to chance; Cohen's d tells you whether the difference is large enough to matter practically.
Mann-Whitney U Test
The Mann-Whitney U test is the non-parametric alternative to the independent samples t-test. It compares the distributions of two independent groups by ranking all observations and testing whether ranks are distributed evenly between groups.
When to use: Your outcome is ordinal (e.g., pain severity on a 1-10 scale), your continuous data violate normality with small sample sizes, or you have significant outliers that would distort the t-test. The Mann-Whitney is also appropriate when you cannot verify the normality assumption due to very small samples (n < 15 per group).
Interpretation: A significant Mann-Whitney result indicates that one group tends to have higher values than the other, it tests stochastic dominance rather than mean difference. Report the U statistic, the p-value, and the rank-biserial correlation as an effect size measure.
One-Way ANOVA
ANOVA (Analysis of Variance) extends the two-group comparison to three or more groups. It tests whether at least one group mean differs significantly from the others. ANOVA compares multiple group means simultaneously while controlling the family-wise error rate, something that multiple t-tests cannot do.
Assumptions: Continuous outcome, normally distributed residuals, homogeneity of variance across groups (Levene's test), and independence. ANOVA is robust to moderate violations of normality when group sizes are equal and reasonably large (n > 20 per group).
Post-hoc tests: A significant ANOVA tells you that at least one group differs but does not identify which groups. Follow up with post-hoc pairwise comparisons, Tukey's HSD (controls for all pairwise comparisons), Bonferroni (conservative), or Games-Howell (when variances are unequal).
Variants: One-way ANOVA compares groups on a single factor. Two-way ANOVA examines two factors simultaneously and their interaction. Repeated measures ANOVA handles within-subjects designs where the same participants are measured multiple times. ANCOVA adds continuous covariates to control for confounding variables.
To determine required sample size for your ANOVA before data collection, run a power analysis specifying the number of groups, expected effect size, and desired power level.
Kruskal-Wallis Test
The Kruskal-Wallis test is the non-parametric equivalent of one-way ANOVA. It compares the distributions of three or more independent groups using ranks.
When to use: Your outcome is ordinal, your continuous data are non-normal across groups, or sample sizes are too small to rely on ANOVA's robustness to non-normality. Common applications include comparing satisfaction ratings across three treatment protocols or comparing pain levels across four dosage groups.
Follow-up: A significant Kruskal-Wallis test indicates that at least one group differs. Use Dunn's test with Bonferroni correction for pairwise comparisons to identify which specific groups differ.
Wilcoxon Signed-Rank Test
The Wilcoxon signed-rank test is the non-parametric alternative to the paired t-test. It compares two related measurements, typically before and after an intervention on the same subjects, without assuming normality.
When to use: You have paired or matched data (the same subjects measured at two time points, or matched case-control pairs) and the difference scores are not normally distributed, or you have an ordinal outcome measured at two time points.
Interpretation: The Wilcoxon test evaluates whether the median difference between paired observations is significantly different from zero. Report the W statistic (or T statistic, depending on software), the p-value, and the matched-pairs rank-biserial correlation as the effect size.
The table below summarizes all comparison tests and their selection criteria.
| Scenario | Parametric Test | Non-Parametric Alternative | Key Assumption Check |
|---|---|---|---|
| 2 independent groups, continuous | Independent t-test | Mann-Whitney U | Normality + equal variance |
| 2 related groups, continuous | Paired t-test | Wilcoxon signed-rank | Normality of differences |
| 3+ independent groups, continuous | One-way ANOVA | Kruskal-Wallis | Normality + homogeneity |
| 3+ related groups, continuous | Repeated measures ANOVA | Friedman test | Sphericity (Mauchly's test) |
Statistical Tests for Relationships and Prediction
When your research question asks about association or prediction rather than group differences, you need correlation or regression methods. Correlation measures association strength between two variables, while regression models the predictive relationship and can control for confounding variables.
Pearson's Correlation Coefficient
Pearson's r measures the strength and direction of the linear relationship between two continuous variables. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.
Assumptions: Both variables are continuous, the relationship is linear (check with a scatterplot), both variables are approximately normally distributed, and there are no extreme outliers that could inflate or deflate the correlation.
When to use: You want to quantify the linear association between two continuous variables, for example, the relationship between study hours and exam scores, or between age and reaction time.
Interpretation: Report both the correlation coefficient (r) and the p-value. A significant p-value tells you the correlation is unlikely to be zero; the magnitude of r tells you the strength. By convention: r = 0.1-0.3 is small, 0.3-0.5 is medium, and above 0.5 is large. Always check the scatterplot, Pearson's r can miss non-linear relationships entirely.
Spearman's Rank Correlation
Spearman's rho is the non-parametric equivalent of Pearson's correlation. It measures the strength and direction of the monotonic (consistently increasing or decreasing) relationship between two variables by converting values to ranks before computing the correlation.
When to use: One or both variables are ordinal, the relationship is monotonic but not linear, data are non-normal, or outliers are present. Spearman's is also appropriate when you suspect a ceiling or floor effect that compresses the distribution of one variable.
Interpretation: Spearman's rho has the same range and general interpretation as Pearson's r, but it captures any monotonic relationship, not just linear ones. If Pearson's r and Spearman's rho give very different values, that discrepancy suggests a non-linear relationship or influential outliers.
Linear Regression
Linear Regression models the relationship between a continuous outcome variable and one or more predictor variables. Unlike correlation, regression allows you to quantify how much the outcome changes for a one-unit change in the predictor while controlling for other variables.
Simple linear regression uses a single predictor. Multiple linear regression includes two or more predictors, enabling you to control for confounders and assess the independent contribution of each variable.
Assumptions: Linear relationship between predictors and outcome, normally distributed residuals (not raw data, a common misconception), homoscedasticity (constant variance of residuals), independence of observations, and no multicollinearity among predictors. Check these with residual plots, the Durbin-Watson statistic, and variance inflation factors (VIF).
When to use: You want to predict a continuous outcome from one or more predictors and quantify the strength and direction of each predictor's contribution. Examples include predicting hospital length of stay from patient age, comorbidity count, and surgery type.
Logistic Regression
Logistic Regression models the probability of a binary outcome (yes/no, success/failure, disease/no disease) as a function of one or more predictor variables. The output is an odds ratio for each predictor, expressing how much the odds of the outcome change for a one-unit increase in the predictor.
When to use: Your outcome variable is binary or dichotomous. Examples include predicting whether a patient will be readmitted (yes/no) based on age, discharge diagnosis, and length of stay, or predicting treatment response (responder/non-responder) from baseline clinical characteristics.
Assumptions: Binary outcome, independent observations, linearity of logit (the log-odds are linearly related to continuous predictors), no multicollinearity, and adequate sample size (commonly cited rule of thumb: at least 10 events per predictor variable). Logistic regression does not assume normality or homoscedasticity.
Variants: Multinomial logistic regression handles outcomes with more than two categories. Ordinal logistic regression handles ordered categorical outcomes. Conditional logistic regression is used for matched case-control studies.
For researchers working with our statistical analysis service, regression modeling is one of the most frequently requested deliverables, particularly multivariable models that reviewers expect to see in observational studies.
Choosing Statistical Tests for Categorical Data
When both your outcome and your predictor are categorical variables, you need tests designed for frequency data rather than means or ranks. These tests compare observed frequencies against expected frequencies under the null hypothesis of no association.
Chi-Square Test of Independence
The chi-square test evaluates whether there is a statistically significant association between two categorical variables. It compares the observed frequencies in a contingency table to the frequencies you would expect if the variables were independent.
Assumptions: Observations are independent (each subject contributes to only one cell), the sample is drawn randomly, and expected cell frequencies are at least 5 in 80% of cells (with none below 1). The expected frequency requirement is critical, violating it inflates the Type I error rate.
When to use: You have two categorical variables and want to test whether they are associated. Examples include testing whether treatment group (drug vs. placebo) is associated with outcome (improved vs. not improved), or whether smoking status (current, former, never) is related to disease status (present vs. absent).
Interpretation: Report the chi-square statistic, degrees of freedom, and p-value. For effect size, report Cramer's V (ranges from 0 to 1, with 0 indicating no association). You can compute chi-square values using our free chi-square calculator.
Fisher's Exact Test
Fisher's Exact Test is used when the chi-square test's assumptions about expected cell frequencies are violated, specifically, when any expected cell count falls below 5 or the total sample size is small.
When to use: You have a 2x2 (or small) contingency table with small expected frequencies. Fisher's test computes the exact probability of observing the data under the null hypothesis rather than relying on the chi-square approximation. It is commonly required in clinical studies with rare outcomes or pilot studies with small samples.
Interpretation: Fisher's test reports an exact p-value (no test statistic). For 2x2 tables, also report the odds ratio and its 95% confidence interval. Fisher's exact test is computationally intensive for large tables, which is why chi-square remains the default for larger samples.
McNemar's Test
McNemar's test is the categorical equivalent of a paired test. It evaluates whether the distribution of a binary outcome changes between two related measurements, typically before and after an intervention on the same subjects.
When to use: You have paired binary data. Examples include testing whether the proportion of patients reporting pain (yes/no) changes from pre-treatment to post-treatment, or whether diagnostic agreement changes between two raters when using a new rating protocol versus the old one.
Interpretation: McNemar's test focuses on the discordant pairs, subjects who changed category between measurements. It tests whether the number changing in one direction significantly exceeds the number changing in the other direction. Report the McNemar chi-square statistic, the p-value, and the proportions at each time point.
| Categorical Data Scenario | Recommended Test | Key Requirement |
|---|---|---|
| 2 categorical variables, large sample | Chi-square test | Expected cell frequency >= 5 |
| 2 categorical variables, small sample | Fisher's exact test | Any expected cell < 5 |
| Paired binary outcome (before/after) | McNemar's test | Same subjects at 2 time points |
| Ordered categorical outcome, 2+ groups | Chi-square for trend | Ordinal outcome variable |
Checking Statistical Assumptions
Every parametric test carries assumptions about your data. Violating these assumptions does not necessarily invalidate the test, some tests are robust to moderate violations, but severe violations can distort your p-values and confidence intervals. Checking assumptions before running your primary analysis is a non-negotiable step in rigorous research.
Testing for Normality
Normal Distribution is the most frequently checked assumption. The Shapiro-Wilk test is the recommended method for samples up to about 5,000 observations. It tests the null hypothesis that data are drawn from a normal distribution. If p < 0.05, you reject normality.
However, do not rely on the Shapiro-Wilk test alone. For large samples, even trivial departures from normality produce significant results. For small samples, the test may lack power to detect meaningful non-normality. Supplement the formal test with visual inspection: Q-Q plots (quantile-quantile plots) show whether data points follow the theoretical normal line, and histograms reveal skewness or multimodality.
What to do when normality is violated: For comparison tests, switch to the non-parametric alternative (Mann-Whitney instead of t-test, Kruskal-Wallis instead of ANOVA). For regression, check normality of residuals rather than raw data, non-normal predictors are acceptable as long as residuals are approximately normal. Data transformation (log, square root, reciprocal) can sometimes normalize skewed data, but report both transformed and untransformed results for transparency.
Testing for Equal Variance
Levene's Test evaluates whether variances are equal across groups, an assumption required by the independent t-test and ANOVA. If Levene's test is significant (p < 0.05), the equal variance assumption is violated.
What to do when variance is unequal: For the t-test, use Welch's correction (most software offers this as an option; some make it the default). For ANOVA, use Welch's ANOVA or the Brown-Forsythe test, which do not assume equal variances. Alternatively, use the non-parametric equivalent, which is distribution-free and therefore unaffected by variance differences.
Independence
The assumption of independence requires that each observation is unrelated to every other observation. Violations occur in clustered data (students within classrooms, patients within hospitals), longitudinal data (repeated measurements on the same subjects), and matched designs.
What to do when independence is violated: Use methods designed for non-independent data. Paired or repeated measures designs require paired tests (paired t-test, repeated measures ANOVA, Wilcoxon, Friedman). Clustered data require multilevel (mixed-effects) models or generalized estimating equations (GEE). Ignoring non-independence produces p-values that are too small, inflating the false positive rate, sometimes dramatically.
Assumption Checking Summary
| Assumption | Test Method | Visual Check | If Violated |
|---|---|---|---|
| Normality | Shapiro-Wilk (p > 0.05 = normal) | Q-Q plot, histogram | Non-parametric test or transform |
| Equal variance | Levene's test (p > 0.05 = equal) | Boxplots by group | Welch's correction or non-parametric |
| Independence | Study design review | Residual plots (autocorrelation) | Paired/repeated measures/mixed models |
| Linearity | Residual plots | Scatterplot with LOESS | Non-linear regression or transformation |
| Homoscedasticity | Breusch-Pagan test | Residuals vs. fitted plot | Robust standard errors or WLS |
Common Statistical Test Selection Mistakes
Even experienced researchers make avoidable errors in statistical test selection. Recognizing these patterns helps you avoid them in your own work and catch them during peer review of others' manuscripts.
Mistake 1: Running multiple t-tests instead of ANOVA. When comparing three or more groups, each pairwise t-test uses a 5% significance threshold. With k groups, you run k(k-1)/2 comparisons. Five groups produce 10 t-tests, and the probability of at least one false positive climbs to approximately 40%. ANOVA with appropriate post-hoc testing maintains the overall error rate at 5%. This is the single most common statistical error in published research.
Mistake 2: Using parametric tests on ordinal data. Likert-scale responses (1-5 or 1-7) are ordinal, the distance between "agree" and "strongly agree" is not necessarily equal to the distance between "neutral" and "agree." Computing means and running t-tests on ordinal data is a common practice but is technically inappropriate. Use Mann-Whitney, Kruskal-Wallis, or ordinal regression for Likert-type outcomes.
Mistake 3: Ignoring paired/repeated measures structure. When the same subjects are measured at two time points, observations are not independent. Using an independent samples t-test instead of a paired t-test ignores the within-subject correlation, typically producing a larger standard error and a less powerful test. The paired design is almost always more powerful because it controls for inter-subject variability.
Mistake 4: Checking normality on the wrong thing. For regression, the normality assumption applies to the residuals, not to the raw predictor or outcome variables. A skewed outcome variable can produce perfectly normal residuals if the model is specified correctly. Check residuals after fitting the model, not the raw data before fitting.
Mistake 5: Using correlation to imply causation or prediction. Pearson's r and Spearman's rho measure the strength of association, they do not establish directionality, causation, or predictive utility. If your research question involves prediction or controlling for confounders, use regression. Correlation is exploratory; regression is explanatory and predictive.
Mistake 6: Applying chi-square to small samples. The chi-square test relies on a large-sample approximation. When expected cell frequencies fall below 5, the approximation breaks down and the p-value becomes unreliable. Use Fisher's Exact Test for small-sample categorical analyses. Many software packages report both by default, always check the expected frequencies before interpreting the chi-square result.
Mistake 7: Failing to run a power analysis before the study. Power Analysis determines the required sample size to detect a meaningful effect with adequate probability. Running a power analysis after data collection (post-hoc power) is widely criticized as uninformative. Plan your sample size before recruiting participants by specifying the expected effect size, desired power (typically 0.80 or higher), and significance level. Use our power analysis calculator to determine required sample size for your study design.
Mistake 8: Selecting a test based on significance. Some researchers run multiple tests and report whichever produces a significant result. This is a form of p-hacking that inflates the false positive rate. Choose your test before looking at the data, document it in your analysis plan, and report the result regardless of statistical significance. Pre-registration of analysis plans is increasingly expected by journals and funding agencies.
Choosing the correct statistical test is not about memorizing a flowchart, it is about understanding what your data look like, what question you are asking, and what assumptions each test requires. The four-question framework outlined in this guide (research question type, data type, number of groups, assumption status) will lead you to the right test for any standard analysis scenario. For complex designs involving multilevel data, time-to-event outcomes, or Bayesian approaches, consider consulting a biostatistician who can tailor the analytical strategy to your specific study. For guidance on when professional support adds the most value, see when to hire a biostatistician.