Compare the means of two independent groups using either Welch's test (recommended) or Student's pooled-variance test. Enter summary statistics or paste raw data columns and get the t statistic, degrees of freedom, p-value, confidence interval for the mean difference, Cohen's d, Hedges' g, and a copy-paste APA write-up.
Enter the mean, standard deviation, and sample size for each group.
Group 1
Group 2
Summary statistics if you only have M, SD, and n. Raw data if you have the original values to paste.
Welch is the default and remains accurate even when variances are unequal. Pick Student only when an equal-variance assumption is justified.
Two-tailed for a non-directional comparison, one-tailed for a pre-registered directional hypothesis. CI level controls the width of the interval reported.
Means, SDs, and ns for summary mode, or paste columns of numbers for raw mode. Results update live.
Inspect t, df, p, and the CI for the mean difference. Copy R or Python code for reproducibility, or paste the APA write-up into your draft.
Want a PhD methodologist to handle the whole project?
Get a complete systematic review or meta-analysis handled end-to-end. From $750 · Quote in under 1 hour · Pay only after you approve scope.
Reporting only t and p hides the size and precision of the effect. The 95 percent CI for the mean difference is the most informative single statistic, and modern journals require it.
Welch's t-test is robust to unequal variances, equal sample sizes are not assumed, and power loss when variances are equal is negligible. Choose Student only when an equal-variance assumption is defensible.
A statistically significant t-test with a tiny Cohen's d may be a sample-size artefact rather than a meaningful effect. Always report d (or g for small samples) so readers can judge practical importance.
Pre-specify the alternative hypothesis, alpha, and effect size of interest before running the test. Adjusting these post hoc invalidates the inference and inflates false positives.
The two-sample t-test answers a question that appears in nearly every empirical study: do two independent groups have the same population mean on some continuous outcome? Gosset (1908), writing as Student to protect the brewing trade secret of his employer, first derived the small-sample distribution that bears his name. The pooled-variance version of the test assumes the two populations share a common variance, in which case both samples can be combined to estimate it. Welch (1947) extended the test to the more realistic case where the two populations have different variances, providing the Satterthwaite approximation to the degrees of freedom that this calculator implements by default.
Modern simulation studies overwhelmingly recommend Welch's variant as the routine choice. Ruxton (2006) showed that Welch achieves nominal Type I error rates across a wide range of variance ratios and sample-size imbalances, while Student's pooled test inflates false positives whenever the larger sample also has the larger variance. Delacre, Lakens, and Leys (2017) reach the same conclusion and explicitly recommend that researchers stop running preliminary equal-variance tests such as Levene's and simply use Welch unconditionally. The calculator therefore defaults to Welch and treats Student's pooled test as a less common option for situations where a specific equal-variance assumption is justified.
The output you should report depends on the outlet. The American Psychological Association manual and most clinical journals expect t, df, exact p, and the confidence interval for the mean difference, plus an effect size. Cohen's d in pooled standard-deviation units is the most widely reported effect size for two-group designs; Hedges (1981) showed that d has a small upward bias for sample sizes below about 30 per group, which is corrected by multiplying by a factor that depends only on the total degrees of freedom. The result is Hedges' g, which the calculator reports alongside d so you can pick the more appropriate one for your sample size.
Two assumptions deserve attention. First, the test assumes the outcome is approximately normally distributed within each group; the central limit theorem makes this assumption forgiving for sample sizes above roughly 30 per group, but at smaller samples a heavy tail or a strong skew can distort both the p-value and the CI. Second, the test assumes the observations are independent; clustered, paired, or longitudinal data violate this assumption and require a different model such as a paired t-test, a mixed-effects model, or a generalised estimating equation. When in doubt, our biostatistics service audits the assumptions and switches to robust or non-parametric alternatives such as the Mann-Whitney U test or a bootstrap CI when needed.
Once you have a t-test result, the next questions are usually about the broader study design. Was the sample size adequate? Use the sample size calculator to verify retrospectively or plan a follow-up. Are you reporting a paper that aggregates several t-tests? Use the effect size calculator to harmonise effect sizes across studies, or feed them into the forest plot generator for a meta-analytic summary. For full-service support including assumption diagnostics, missing-data imputation, multiple-comparison adjustments, and APA-compliant manuscript text, our statistical analysis service delivers a PhD-led report and reproducible code with every project.
Use a two-sample t-test when you want to compare the means of one continuous outcome between two independent groups, such as a treatment arm versus a control arm in a parallel-group trial or men versus women in a survey. The two groups must be independent, meaning observations in one group are not paired with observations in the other. For paired or repeated measures designs, use a paired t-test instead.
Use Welch's t-test by default. It does not assume the two groups have equal variances, and when the assumption of equal variances does happen to hold, Welch's test loses almost no power compared with Student's pooled-variance test. The simulation work of Ruxton (2006) and Delacre, Lakens, and Leys (2017) recommends Welch unconditionally as the better routine choice. Use Student's pooled-variance test only when you have a strong prior reason to assume equal variances or you must match a textbook procedure.
The p-value is the probability of observing a difference between sample means as extreme as the one you found, or more extreme, if the true population means were actually equal. A small p-value (commonly below 0.05) suggests the observed difference would be unusual under the null hypothesis of no difference, leading you to reject the null. The p-value is not the probability that the null is true, and it is not the size of the effect.
The interval gives the range of plausible values for the population mean difference. A 95 percent CI is built using a procedure that captures the true difference 95 times out of 100 across repeated samples. If the interval excludes zero, the difference is statistically significant at the 5 percent level, which mirrors the p-value decision. Always report the CI alongside the t and p value because it shows both the size and the precision of the effect.
Effect sizes complement the p-value. Cohen's d expresses the mean difference in pooled standard-deviation units, with conventions of 0.2 (small), 0.5 (medium), and 0.8 (large). Hedges' g applies a small-sample correction factor to Cohen's d that removes upward bias when group sizes are below about 30. Reporting d and g lets readers compare your effect to the broader literature and lets your study contribute to a meta-analysis without recomputation.
Sample size depends on the effect size you expect to detect, the alpha level, and the desired power. As a rough anchor, detecting a Cohen's d of 0.5 with 80 percent power at alpha 0.05 (two-tailed) requires about 64 participants per group. Use our sample size calculator to plan precisely, and increase the target if you anticipate dropout, missing data, or non-normal outcomes.
The t-test assumes the outcome is approximately normally distributed within each group, especially at small sample sizes. With sample sizes above roughly 30 per group, the central limit theorem makes the t-test robust to moderate non-normality. For heavily skewed outcomes or small samples, consider a Mann-Whitney U test or a bootstrap confidence interval. Our biostatistics service handles non-parametric and robust alternatives for any design.
A two-tailed test asks whether the two means differ in either direction; a one-tailed test asks whether the mean of group 1 is specifically greater than or specifically less than the mean of group 2. One-tailed tests have more power but require a directional hypothesis specified before looking at the data. Most journals expect two-tailed tests unless a directional prediction is registered in advance.
No. This calculator runs an independent-samples test, where group 1 and group 2 are different participants. For paired data such as pre-post measurements on the same person or matched cases and controls, you need a paired t-test, which compares the within-subject differences against zero. Our statistical analysis service handles paired designs and the more general repeated-measures and mixed-effects models built around them.
No. All inputs and outputs stay inside your browser tab. Nothing is sent to a server, so the calculator is safe to use with proprietary or confidential research data.
If you only have means and SDs and want the CI for the difference rather than a hypothesis test, the confidence interval calculator includes a Welch mean-difference tab. To plan the sample size needed for a future t-test, use the sample size calculator. To convert a published p-value back into a CI, see the p-value to confidence interval converter. If you have multiple predictors rather than a binary group, the linear regression calculator handles continuous and categorical covariates together. For meta-analytic effect-size harmonisation, the effect size calculator converts t-test output into Cohen's d, Hedges' g, or correlation r.
Reviewed by
Dr. Sarah Mitchell holds a PhD in Biostatistics from Johns Hopkins Bloomberg School of Public Health and has over 15 years of experience in systematic review methodology and meta-analysis. She has authored or co-authored 40+ peer-reviewed publications in journals including the Journal of Clinical Epidemiology, BMC Medical Research Methodology, and Research Synthesis Methods. A former Cochrane Review Group statistician and current editorial board member of Systematic Reviews, Dr. Mitchell has supervised 200+ evidence synthesis projects across clinical medicine, public health, and social sciences. She reviews all Research Gold tools to ensure statistical accuracy and compliance with Cochrane Handbook and PRISMA 2020 standards.
Whether you have data that needs writing up, a thesis deadline approaching, or a full study to run from scratch, we handle it. Average turnaround: 2-4 weeks.