Can a measure be reliable but not valid?

Yes. A scale that always reads three kilograms too high is perfectly reliable but invalid. Reliability concerns random error and validity concerns systematic bias, so consistency alone does not ensure accuracy.

What is a good Cronbach's alpha value?

Values from roughly 0.70 to 0.95 are generally considered acceptable for internal consistency. Very low values suggest the items do not measure one construct, while values near or above 0.95 can indicate redundant items.

What are the main types of validity?

The principal types are content validity, construct validity (including convergent and discriminant evidence), and criterion validity (concurrent and predictive). Face validity, whether an instrument looks reasonable, is the weakest and is never sufficient alone.

How do you validate a questionnaire?

Start with expert review for content validity, pilot the instrument to refine items, then collect data to assess internal consistency, examine the factor structure for construct validity, and test convergent, discriminant, and criterion validity. Standards such as COSMIN specify what to report.

Reliability and Validity in Research

Q: What is the difference between reliability and validity?

Reliability is the consistency of a measurement, whether it gives the same result under the same conditions. Validity is its accuracy, whether it measures what it claims to. A measure can be consistent yet systematically wrong, so reliability does not guarantee validity.

Reliability is the consistency of a measurement, whether it gives the same answer under the same conditions, and validity is its accuracy, whether it measures what it claims to measure. The two are distinct and not interchangeable: a bathroom scale that reads three kilograms heavy every time is perfectly reliable and completely invalid. Any study that relies on a questionnaire, a scale, or a rating has to establish both, because a measurement instrument that is neither consistent nor accurate cannot support a defensible conclusion no matter how sophisticated the later analysis.

Why a measure can be reliable without being valid

This is the idea that anchors everything else. Reliability concerns random error: a noisy instrument scatters its readings. Validity concerns systematic error, or bias: a biased instrument is consistently wrong in the same direction. You can have consistency without accuracy, as the heavy scale shows, but you cannot have accuracy without consistency, because an instrument that gives different answers each time cannot be reliably hitting the truth. Reliability is therefore a necessary but not sufficient condition for validity. Establishing reliability first, then validity, is the logical order for validating any instrument you build a study around.

The main types of reliability

Reliability is assessed in several complementary ways, and which ones you need depends on the instrument:

Internal consistency asks whether the items on a multi-item scale measure the same underlying construct. It is the most commonly reported form, usually summarized by Cronbach's alpha, where values from roughly 0.70 to 0.95 are typically considered acceptable. You can compute it directly with our Cronbach's alpha calculator.
Test-retest reliability asks whether the same people score consistently when measured on two occasions, capturing stability over time.
Inter-rater reliability asks whether different raters assign consistent scores to the same cases, which matters whenever judgment is involved. For categorical ratings this is usually quantified with kappa; our guide to inter-rater reliability covers that case in depth.
Parallel-forms reliability asks whether two equivalent versions of an instrument produce consistent results.

Reporting the form of reliability that matches your instrument, rather than defaulting to Cronbach's alpha for everything, is a mark of a careful measurement section.

The main types of validity

Validity is the more demanding property and comes in layers:

Content validity asks whether the items adequately cover the full concept, usually judged by experts. A depression scale that omits sleep and appetite has a content gap.
Construct validity asks whether the instrument truly captures the abstract construct it targets, evaluated through convergent validity (it correlates with measures it should) and discriminant validity (it does not correlate with measures it should not).
Criterion validity asks whether scores relate to an external benchmark, either at the same time (concurrent validity) or in predicting a future outcome (predictive validity).
Face validity asks whether the instrument looks reasonable on the surface. It is the weakest form and never sufficient on its own.

Construct validity is where serious instrument development concentrates, and it is frequently examined with factor analysis to confirm that items group onto the dimensions the theory predicts.

How to validate a questionnaire in practice

Validating a new questionnaire is a structured sequence, not a single test. You begin with content validity by having experts review the item pool against the construct. You pilot the instrument to refine wording and check that respondents interpret items as intended. You then collect data and assess internal consistency, examine the factor structure to confirm the dimensions, and test convergent and discriminant validity against related and unrelated measures. Standards such as the COSMIN guidance lay out exactly which measurement properties to report and how, and editors increasingly expect that level of documentation. The right sample size for this work depends on the number of items and factors, the kind of planning question our guidance on choosing an appropriate analysis helps you think through before data collection.

Need statistical analysis support?

Our PhD statisticians handle data analysis, produce reproducible R code, and write results sections that satisfy peer reviewers.

Chat on WhatsApp Get a Free Quote

Validity and reliability in qualitative research

The vocabulary shifts but the concern does not. Qualitative researchers speak of trustworthiness, built from credibility, transferability, dependability, and confirmability, rather than reliability and validity in the psychometric sense. The underlying question is the same: can a reader trust that the findings reflect the data rather than the researcher's preconceptions? That is why an audit trail and reflexivity carry the weight in qualitative work that internal consistency and construct validity carry in quantitative measurement.

What reliability means quantitatively, and the error it hides

Behind the word consistency sits a precise model. Classical test theory writes every observed score as a true score plus random error, and defines reliability as the proportion of the observed variance that is true-score variance:

observed = true + error      reliability = var(true) / var(observed)

The number most worth reporting is not the reliability coefficient itself but its consequence for an individual score, the standard error of measurement, which is the outcome standard deviation multiplied by the square root of one minus the reliability. It tells you how wide a confidence band to put around any single person's score, and a scale with an impressive alpha can still have a standard error too large for individual decisions. Unreliability also attenuates every correlation you compute with the scale: a true association is dragged toward zero in proportion to the measurement noise, which is why an unreliable instrument quietly drains statistical power and can make a real effect look null.

Cronbach's alpha is weaker than its reputation

Alpha is reported almost reflexively, and almost as often misunderstood. It is a on reliability that equals the true value only under , the unrealistic assumption that every item loads equally on the construct; when items differ in quality, alpha understates reliability. It is not a measure of , a multidimensional scale can post a high alpha, and it rises mechanically with the number of items, so a long mediocre scale can look reliable. Two refinements matter. Prefer , computed from a factor model, which drops the equal-loading assumption and is the coefficient most measurement methodologists now recommend. And for ordered Likert items, compute from polychoric correlations rather than the standard versions, which are attenuated for coarse scales. One more reading is counterintuitive: an alpha is usually a warning, not a triumph, signalling redundant items that ask the same question in slightly different words.

Content validity, internal consistency, and construct validity, designed and analyzed properly. Get a free quote.

Generalizability theory: reliability across several sources of error at once

Classical test theory bundles all error into a single term, but real measurements vary across several facets at the same time, different raters, different items, different occasions, and you often want to know how much each contributes. Generalizability theory extends reliability by using analysis of variance to partition the error into its sources, then a decision study estimates how reliability would change if you added raters or items. It answers the design question classical reliability cannot: given a fixed budget, is it better to recruit more raters or to write more items? For any instrument where multiple sources of inconsistency operate together, this is the rigorous framework.

Estimating these in R

library(psych)

# Internal consistency done properly: omega (and ordinal omega for Likert items)
omega(items)                          # McDonald's omega from a factor model
alpha(items)                          # Cronbach's alpha, for comparison only
omega(polychoric(items)$rho)          # ordinal omega for coarse Likert scales

# Standard error of measurement around a single score
sem <- sd(score) * sqrt(1 - rel)

library(irr)
kappa2(ratings[, 1:2], weight = 'squared')   # weighted kappa, two raters, ordinal
icc(ratings, model = 'twoway', type = 'agreement', unit = 'single')  # continuous ratings

Why this is worth getting right

Measurement sits upstream of every result. If the instrument is unreliable, your statistical power drains away into noise; if it is invalid, your conclusions are precise statements about the wrong thing. Reviewers know this and scrutinize the measurement section accordingly. When a study depends on a custom scale or a translated instrument, having the validation designed and analyzed properly, including the factor structure and the reliability statistics, protects every finding that rests on it.

Reliability and Validity: What They Mean and How to Test Them

Key Takeaways

Why a measure can be reliable without being valid

The main types of reliability

The main types of validity

How to validate a questionnaire in practice

Validity and reliability in qualitative research

What reliability means quantitatively, and the error it hides

Cronbach's alpha is weaker than its reputation

Generalizability theory: reliability across several sources of error at once

Estimating these in R

Why this is worth getting right

Establish reliability before validity

Match the reliability statistic to the instrument

Treat face validity as a starting point only

Report omega and the standard error of measurement, not just alpha

Use the intraclass correlation, not kappa, for continuous ratings

Frequently Asked Questions

Related Articles

Need a Statistician? Our PhD Team Handles the Numbers.

Dr. Sarah Mitchell

The methodologists behind your review

Need a Statistician? Our PhD Team Handles the Numbers.

Inter-rater agreement: choose the right coefficient

Modern validity theory: it is the inference that is valid, not the test

Related Articles