QUADAS-2 is the standard tool for assessing risk of bias and applicability in primary diagnostic test accuracy studies included in systematic reviews. It is the second version of the Quality Assessment of Diagnostic Accuracy Studies tool, published in 2011 by Penny Whiting and colleagues on behalf of the Cochrane Screening and Diagnostic Tests Methods Group, and remains the tool endorsed by Cochrane and recommended by major medical journals for any diagnostic accuracy review.
The tool covers four risk-of-bias domains (patient selection, index test, reference standard, flow and timing) and assesses applicability concerns in the first three of those domains separately. Each domain is evaluated using signaling questions that guide the reviewer to a domain-level judgment of low, high, or unclear risk of bias. This guide unpacks the four domains, walks through the signaling questions, explains the customization step that QUADAS-2 expects every review to perform, and covers extensions such as QUADAS-C for comparative diagnostic accuracy.
The Four Domains and Why QUADAS-2 Replaced QUADAS-1
The original QUADAS tool, published in 2003 by Penny Whiting, Anne Rutjes, and colleagues, was the first widely adopted instrument for assessing diagnostic accuracy studies. It contained fourteen items scored as yes, no, or unclear. The tool quickly became standard in Cochrane diagnostic test accuracy reviews and was cited in thousands of systematic reviews over the following decade.
Practical experience with QUADAS-1 surfaced three recurring problems. First, the single composite score that some reviewers computed from the fourteen items was unreliable and not endorsed by the developers, but its use was widespread enough that interpretations diverged. Second, several items mixed risk of bias with quality of reporting, which conflated study conduct with study description and confused reviewers. Third, applicability concerns, the question of whether a study's design matches the review question, were not cleanly separated from risk of bias, leading to inconsistent ratings across reviews.
QUADAS-2 addressed all three problems with a redesign. The new tool has four risk-of-bias domains rather than fourteen items, separates risk of bias from applicability explicitly, uses signaling questions that guide rather than dictate the judgment, and explicitly recommends against any composite score. The structure is closer to what reviewers familiar with RoB 2 for randomized trials or ROBINS-I for non-randomized studies expect: domain-level judgments rather than item-level scores. The quality assessment tools compared overview places QUADAS-2 in the broader landscape of risk-of-bias tools.
Domain 1: Patient Selection
The first domain asks whether the way patients were selected for the study could have introduced bias into the diagnostic accuracy estimates. Two signaling questions guide the judgment. The first asks whether a consecutive or random sample of patients was enrolled. Convenience sampling, especially when patients were selected based on prior clinical suspicion or prior test results, can inflate the apparent accuracy of the index test. The second asks whether the study avoided inappropriate exclusions, such as removing patients with difficult-to-classify disease, equivocal results, or comorbidities that complicate diagnosis.
A third signaling question, applied when the study used a case-control design, asks whether the study avoided a case-control design. Case-control studies for diagnostic accuracy almost always overestimate sensitivity and specificity because they sample only clearly positive and clearly negative patients, omitting the diagnostic uncertainty that defines real-world clinical practice. The signaling question is phrased so that "yes" indicates lower bias risk: yes, the study avoided a case-control design.
The applicability concern for patient selection asks a separate question: does the patient sample match the review question? A well-conducted study in a tertiary referral cohort may have low risk of bias but high applicability concern if the review question is about primary care. This separation is one of the most important conceptual contributions of QUADAS-2. It allows a reviewer to flag a study as well-conducted but not directly relevant, rather than penalizing it twice.
Domain 2: Index Test
The second domain covers the index test, which is the test whose accuracy the review is evaluating. The first signaling question asks whether the index test results were interpreted without knowledge of the reference standard. Knowledge of the reference standard during index test interpretation introduces interpretation bias, which can substantially inflate apparent accuracy. The blinding question is sometimes hard to answer from the published methods: many studies do not state explicitly whether the index test reader was blinded.
The second signaling question, applied when the index test has a threshold, asks whether the threshold was pre-specified. Post-hoc threshold optimization, where the study chose the threshold after seeing the data, inflates apparent accuracy because the chosen threshold maximizes performance in the study sample. Pre-specified thresholds, especially those defined in a separate derivation cohort or in published clinical guidelines, avoid this inflation. If the threshold was data-driven, the signaling question is rated "no" and the domain typically rated at high risk.
The applicability concern for the index test asks whether the test, as conducted in the study, matches the test the review is evaluating. Subtle differences (a different machine model, a different antibody, a different software version, a different reader experience level) can introduce applicability concerns even when the index test concept is the same. Reviewers should consult the diagnostic test accuracy meta-analysis framework for how to handle this kind of subtle test variation.
Domain 3: Reference Standard
The third domain covers the reference standard, which is the test or process used to determine the true disease status against which the index test is compared. The first signaling question asks whether the reference standard is likely to correctly classify the target condition. An imperfect reference standard introduces bias that propagates into the index test's estimated accuracy. A reviewer assessing a new biomarker against an established histopathology reference can usually answer this favorably; a reviewer assessing a new test against an older test with known accuracy limitations cannot.
The second signaling question asks whether the reference standard results were interpreted without knowledge of the index test. The same blinding concern that applies to the index test applies in reverse to the reference standard. If the pathologist or radiologist reading the reference standard knew the index test result, their interpretation may have been influenced, which biases the diagnostic accuracy estimate in either direction depending on the nature of the influence.
The applicability concern for the reference standard asks whether the target condition as defined by the reference standard matches the target condition in the review question. A reference standard that defines disease more broadly or more narrowly than the review intends can produce applicable-looking accuracy estimates that do not generalize. This is especially common for evolving diagnostic criteria, where reference standards in older studies may not match contemporary definitions.
Domain 4: Flow and Timing
The fourth domain covers the flow of patients through the study and the timing between the index test and the reference standard. Three signaling questions structure the judgment. The first asks whether there was an appropriate interval between the index test and the reference standard. Long intervals introduce the possibility that disease status changed between the two tests, which decouples the index test result from the reference standard truth and biases the accuracy estimates.
The second asks whether all patients received the same reference standard. Differential verification occurs when some patients receive one reference standard and others receive a different one, often based on the index test result. This introduces bias because the apparent accuracy depends on how the two reference standards classify cases differently. The third asks whether all patients were included in the analysis. Partial verification occurs when patients without a reference standard result are excluded from the analysis, which is again often correlated with the index test result. Both differential and partial verification require careful handling in the bias judgment.
Patient flow problems are the most commonly missed source of bias in diagnostic accuracy reviews. Reviewers focused on patient selection and index test blinding sometimes treat flow and timing as a checklist item rather than a substantive concern, but the bias introduced by inappropriate flow can be larger than the bias from interpretation issues. There is no separate applicability concern for flow and timing because flow is fundamentally about how the study was run rather than what the study was about.
Signaling Questions Versus Domain Judgments
A central design feature of QUADAS-2 is the distinction between signaling questions and domain judgments. Signaling questions are concrete, factual questions about study conduct (was the threshold pre-specified, were patients consecutively enrolled). Domain judgments are summary risk-of-bias ratings for the entire domain on a three-point scale: low, high, or unclear.
The signaling questions guide the reviewer toward the domain judgment but do not determine it mechanically. A reviewer assessing a study with three signaling questions answered yes and one answered unclear must apply judgment to decide whether the unclear answer is enough to push the domain to high risk, whether it is methodologically minor enough to keep the domain at low risk, or whether reporting is incomplete enough to rate the domain unclear. The tool's developers explicitly support this judgment-based approach over mechanical scoring.
This is why QUADAS-2 does not produce a numerical score. Composite scoring was a known problem with QUADAS-1 and is explicitly discouraged in QUADAS-2 documentation. Reviewers should resist the temptation to compute a weighted average of domain ratings: such scores are not validated and undermine the structured judgment the tool is designed to support. The risk of bias assessment guide discusses this design pattern across all current risk-of-bias tools.