The GRADE framework certainty evidence system is the most widely adopted method for rating how confident researchers and clinicians should be in a body of evidence from a systematic review. GRADE (Grading of Recommendations, Assessment, Development and Evaluations) provides a structured, transparent approach that separates the certainty of evidence from the strength of recommendations, a distinction that transformed how systematic reviews inform clinical practice and public health policy.
The GRADE framework (Grading of Recommendations, Assessment, Development and Evaluations) is a systematic approach for rating the certainty of evidence in systematic reviews and clinical guidelines. Adopted by over 110 organizations including Cochrane, WHO, and NICE, GRADE evaluates evidence across 5 downgrade domains (risk of bias, inconsistency, indirectness, imprecision, publication bias) and 3 upgrade domains, producing ratings of High, Moderate, Low, or Very Low certainty.
We apply GRADE assessments in every systematic review we deliver, the most common client question is why observational evidence starts at "low" rather than being rated on its own merits. This guide walks through every domain, every certainty level, and a complete worked example so you can apply the framework confidently. Rate your evidence certainty with our free GRADE assessment tool, structured domain-by-domain with export.
What Is the GRADE Framework?
GRADE is a systematic, transparent framework for rating the certainty of a body of evidence. It evaluates 5 domains that can downgrade evidence (risk of bias, inconsistency, indirectness, imprecision, publication bias) and 3 that can upgrade observational evidence, producing 4 certainty levels. The GRADE framework assesses certainty of evidence for each outcome in a systematic review, not for individual studies, but for the entire body of evidence contributing to an effect estimate.
Before GRADE, systematic review authors used inconsistent terminology and subjective judgments to describe the quality of their evidence. Some labeled evidence as "strong" or "weak" without defining what those terms meant. Others applied different criteria depending on the field, making it impossible to compare evidence ratings across reviews or disciplines.
The GRADE Working Group, established in 2000, developed the framework to solve this problem. Their goal was to create a single, transparent system that any researcher could apply consistently and any reader could interpret unambiguously. By 2026, GRADE has been adopted by over 110 organizations worldwide including Cochrane, WHO, NICE, and the CDC for rating the certainty of evidence in systematic reviews and clinical practice guidelines (GRADE Working Group).
It is critical to understand what GRADE is not. GRADE does not assess the methodological quality of individual studies, that is the role of risk of bias tools like RoB 2 or ROBINS-I, covered in our RoB assessment overview. GRADE operates at the outcome level, evaluating the entire body of evidence for a specific outcome across all included studies. A GRADE assessment asks: "How confident are we that the true effect lies close to the estimated effect?", not "Was this study well-conducted?"
This distinction matters because a single high-quality RCT can provide high-certainty evidence, while ten poorly designed RCTs may provide only low-certainty evidence. The assessment is always about the body of evidence, not any one study.
The 5 Downgrade Domains in GRADE Assessment
Each GRADE assessment begins at a starting certainty level, High for randomized controlled trials, Low for observational studies, and then evaluates 5 domains that can each lower the certainty by one or two levels. Understanding these GRADE domains is the core skill required for any researcher conducting a systematic review.
Risk of bias is input to GRADE assessment at the domain level. Each domain is assessed as "no serious concern," "serious concern" (downgrade by one), or "very serious concern" (downgrade by two). The total downgrade across all 5 domains determines the final certainty of evidence rating.
Domain 1, Risk of Bias (Study Limitations)
The risk of bias domain evaluates whether the included studies have systematic flaws that could bias the effect estimate. For RCTs, this includes inadequate randomization, lack of blinding, incomplete outcome data, selective reporting, and other sources of bias. For observational studies, this includes confounding, selection bias, measurement error, and missing data.
GRADE assesses risk of bias across the body of evidence, not for individual studies. If most studies contributing to an outcome are at low risk of bias and a single small study is at high risk, the overall domain judgment may still be "no serious concern." Conversely, if the largest and most influential study has critical risk of bias, a downgrade is warranted even if smaller studies are sound.
The most common mistake here is double-counting. If you already assessed risk of bias using RoB 2 or the Newcastle-Ottawa Scale, the GRADE risk of bias domain draws on those results, it does not repeat the assessment. It synthesizes individual study ratings into a single judgment about the body of evidence. Use our risk of bias assessment tool to ensure your individual study assessments are correct before applying GRADE.
Domain 2, Inconsistency (Heterogeneity Across Studies)
The inconsistency domain evaluates whether the results of individual studies point in the same direction and have similar magnitude. When study results vary widely, some showing benefit, others showing harm, or effect sizes ranging from trivial to large, our confidence that the true effect lies near the pooled estimate decreases.
GRADE uses several indicators to assess inconsistency: the direction of effects across studies, the overlap of confidence intervals, statistical heterogeneity measured by I-squared (I-squared measures statistical heterogeneity in meta-analysis), and the results of subgroup analyses exploring the source of variation. An I-squared above 50% raises concern, but the number alone is insufficient, the clinical significance of the variation matters more than the statistical metric.
If heterogeneity can be explained by pre-specified subgroup analyses (for example, studies in children versus adults show different effects, and this difference is clinically plausible), you may decide not to downgrade. Instead, you present subgroup-specific estimates with their own GRADE ratings. Unexplained heterogeneity, however, warrants downgrading.
Domain 3, Indirectness (Generalizability Issues)
The indirectness domain assesses whether the evidence directly answers the review question. Indirectness occurs when the population, intervention, comparator, or outcome in the included studies differs from the question the review is trying to answer.
There are two types of indirectness. Direct indirectness occurs when studies use a proxy outcome instead of the outcome of interest, for example, using bone density as a surrogate for fracture risk. Indirect comparisons occur when you are comparing interventions A and B, but no head-to-head trials exist, so you rely on studies comparing A versus placebo and B versus placebo separately.
Indirectness is the domain most frequently overlooked by novice GRADE users. If your review includes studies conducted exclusively in high-income hospital settings but your review question concerns primary care in low-resource settings, the evidence is indirect even if the studies are perfectly designed.
Domain 4, Imprecision (Wide Confidence Intervals, Few Events)
The imprecision domain evaluates whether the effect estimate is precise enough to support a confident conclusion. Wide confidence intervals that cross clinically important thresholds, for example, an interval that includes both meaningful benefit and meaningful harm, indicate imprecision.
GRADE recommends assessing imprecision using optimal information size (OIS): would the total sample across included studies be large enough to detect a clinically important effect in a single adequately powered trial? If the total sample is smaller than the OIS, or if the confidence interval crosses the threshold for clinical decision-making, downgrade for imprecision.
For dichotomous outcomes, a total event count below 300 is a common threshold for concern. For continuous outcomes, GRADE suggests examining whether the confidence interval includes effects that would lead to different clinical decisions. The key question is whether the data provide enough precision to act on.
Domain 5, Publication Bias (Missing Studies)
The publication bias domain addresses the possibility that studies with null or negative results were never published, skewing the available evidence toward positive findings. A funnel plot detects publication bias by plotting study effect sizes against their precision, asymmetry suggests that small studies with unfavorable results may be missing.
Assessing publication bias is inherently difficult because you are trying to detect what is not there. GRADE recommends considering several factors: funnel plot asymmetry (when at least 10 studies are available), whether the included studies were funded by industry (which has a documented tendency toward selective reporting), whether unpublished data were sought, and whether study registries like ClinicalTrials.gov reveal completed but unreported studies.
In practice, publication bias is the domain where reviewers are most uncertain. When in doubt, GRADE suggests a "suspected" judgment with a downgrade by one level rather than ignoring the concern entirely.
The 3 Upgrade Domains for Observational Studies
Observational studies start at Low certainty in the GRADE framework, but three domains can upgrade the certainty by one or two levels. These upgrade domains are applied only after the 5 downgrade domains have been assessed and only when no downgrade has been applied for any domain, or when the evidence is so compelling that upgrading is justified despite downgrades.
Large magnitude of effect, When the effect size is large (relative risk greater than 2 or less than 0.5 in the absence of plausible confounders), GRADE permits upgrading by one level. Very large effects (RR greater than 5 or less than 0.2) can warrant upgrading by two levels. Evidence from RCTs starts at High certainty and can be downgraded, while observational evidence starts at Low and can be upgraded if the effect size is large (RR > 2), a dose-response gradient exists, or plausible confounding would reduce the observed effect (Schunemann et al., 2013).
Dose-response gradient, A clear dose-response relationship between exposure and outcome increases confidence that the association is causal. If higher doses consistently produce larger effects in a graded, predictable pattern, the certainty of evidence can be upgraded by one level.
Plausible confounding would reduce the observed effect, If all plausible residual confounders would bias the result toward the null (or toward a smaller effect), but a significant effect is still observed, this strengthens the causal inference. For example, if unmeasured confounders would be expected to reduce the association between smoking and lung cancer, but the association remains strong, confounding actually makes the true effect more likely to be at least as large as observed.
GRADE Framework Certainty of Evidence Levels
The four certainty of evidence levels represent the endpoint of every GRADE assessment. Each level communicates a specific degree of confidence in the effect estimate, and each has direct implications for how the evidence should be used in decision-making.
| Certainty Level | Definition | What It Means for Decision-Making | Starting Point |
|---|---|---|---|
| High certainty | Very confident that the true effect lies close to the estimate | Strong basis for recommendations; unlikely to change with new research | RCTs (no downgrades) |
| Moderate certainty | Moderately confident; true effect is likely close to estimate | Reasonable basis for recommendations; new research may change confidence | RCTs (1 downgrade) or Observational (upgrades) |
| Low certainty | Limited confidence; true effect may be substantially different | Weak basis for recommendations; new research likely to change estimate | Observational (no changes) or RCTs (2 downgrades) |
| Very low certainty | Very little confidence; true effect likely substantially different | Very weak basis; any estimate is highly uncertain | Multiple downgrades from starting point |
Evidence from randomized controlled trials (RCTs) starts at high certainty and can only move downward through the 5 downgrade domains. Evidence from observational studies starts at low certainty and can move downward through downgrades or upward through the 3 upgrade domains.
The final evidence certainty rating is determined by the cumulative effect of all domain assessments. A body of RCT evidence with serious risk of bias and serious imprecision would be downgraded twice, moving from High to Low. Observational evidence with a very large effect size and no downgrades could be upgraded from Low to Moderate or even High, though this is rare in practice.
It is important to note that certainty of evidence is not the same as strength of recommendation. A strong recommendation can be made even with low-certainty evidence when the benefits clearly outweigh the harms, and a weak recommendation may be appropriate even with high-certainty evidence when the balance of benefits and harms is close.
How to Create a Summary of Findings Table
GRADE produces Summary of Findings tables, the standard format for presenting evidence certainty alongside effect estimates for each outcome. A summary of findings table is required by Cochrane for all systematic reviews and recommended by PRISMA 2020 reporting guidelines. It is the single most important output of a GRADE assessment.
A standard SoF table includes the following columns for each outcome:
| Column | Content |
|---|---|
| Outcome | Name of the outcome assessed |
| Number of participants (studies) | Total sample size and number of contributing studies |
| Relative effect (95% CI) | Risk ratio, odds ratio, or hazard ratio with confidence interval |
| Absolute effect (95% CI) | Risk difference or mean difference with confidence interval |
| Certainty | GRADE rating: High, Moderate, Low, or Very Low |
| Comments | Footnotes explaining downgrade/upgrade decisions |
GRADEpro GDT is the standard software for creating SoF tables. It is free for systematic review authors and provides structured templates that enforce consistent formatting. The GRADEpro software guides you through each domain assessment, records your rationale for each judgment, and generates publication-ready tables. Manual creation of SoF tables is possible but error-prone, we strongly recommend using GRADEpro for all reviews.
The SoF table should include the 7 most important outcomes for decision-making, as determined during protocol development. Selecting outcomes after seeing results introduces bias, Cochrane recommends pre-specifying outcomes in the registered protocol. A systematic review follows PRISMA 2020 reporting guidelines (Page et al., 2021), and the SoF table fulfills the PRISMA requirement for evidence certainty assessment.
Risk of bias is GRADE Domain 1, make sure your assessment is correct using our RoB assessment overview.
Common GRADE Assessment Mistakes
Even experienced reviewers make errors when applying the GRADE framework. These mistakes reduce the credibility of the certainty assessment and can lead reviewers and editors to question the validity of the entire systematic review. Avoiding these errors requires understanding what GRADE is designed to do, and what it is not.
Double-counting risk of bias. The most frequent error is assessing risk of bias at the individual study level (using RoB 2 or similar tools) and then re-penalizing the same concerns in the GRADE risk of bias domain. GRADE Domain 1 draws on your individual study assessments, it does not repeat them. If three of five studies have high risk of bias due to lack of blinding, the GRADE judgment considers how this affects the body of evidence for that outcome. Penalizing both the individual studies and the GRADE domain effectively counts the same limitation twice.
Downgrading for every minor concern. GRADE is a judgment framework, not a checklist. Not every minor concern warrants a downgrade. If studies show slight heterogeneity (I-squared of 30%) with overlapping confidence intervals and clinically consistent effect directions, a judgment of "no serious inconsistency" is appropriate. Over-downgrading produces artificially low certainty ratings that do not reflect the actual state of the evidence.
Not documenting rationale for downgrade/upgrade decisions. Every GRADE judgment must include explicit reasoning. Stating "downgraded for risk of bias" without explaining which studies drove the concern, how prevalent the bias was, and how it likely affected the estimate fails to meet GRADE transparency standards. Footnotes in the SoF table should contain specific, auditable rationale.
Confusing certainty of evidence with recommendation strength. Certainty of evidence answers "How confident are we in the effect estimate?" Recommendation strength answers "Should we recommend this intervention?" These are related but distinct. High-certainty evidence that an intervention has a very small benefit does not automatically warrant a strong recommendation. Low-certainty evidence may support a strong recommendation when the stakes are high and no alternatives exist.
Applying GRADE to individual studies rather than a body of evidence. GRADE assesses the certainty of evidence for an outcome across all studies contributing to that outcome in the review. It does not rate individual studies. An individual study receives a risk of bias assessment, the body of evidence for an outcome receives a GRADE certainty rating.
GRADE in Practice, A Worked Example
A practical example demonstrates how the GRADE framework moves from starting certainty to a final rating through domain-by-domain assessment. This worked example follows the process you would apply to any outcome in your own systematic review.
Scenario: You are conducting a systematic review of cognitive behavioral therapy (CBT) versus usual care for reducing anxiety symptoms in adults with generalized anxiety disorder. Five RCTs contribute to the outcome "anxiety symptom reduction measured by the Hamilton Anxiety Rating Scale (HAM-A) at 12 weeks." The pooled mean difference is -4.2 points (95% CI: -6.1 to -2.3), favoring CBT. The minimal clinically important difference is 3 points.
Starting certainty: High (body of evidence consists of RCTs).
Domain 1, Risk of bias: Three of five studies have adequate randomization and allocation concealment. Two studies lack blinding of outcome assessors, but the HAM-A is a clinician-rated scale, making this a meaningful concern. However, the two unblinded studies are the smallest contributors to the pooled estimate. Judgment: Serious concern. Downgrade by one. Certainty moves to Moderate.
Domain 2, Inconsistency: All five studies favor CBT. Effect sizes range from -3.1 to -5.8 points. I-squared is 38%, and all confidence intervals overlap. The variation is clinically unimportant, all studies show effects above the minimal clinically important difference. Judgment: No serious concern. No downgrade. Certainty remains Moderate.
Domain 3, Indirectness: All studies enrolled adults with DSM-5-diagnosed generalized anxiety disorder. All used face-to-face CBT delivered by licensed therapists. The population, intervention, comparator, and outcome match the review question directly. Judgment: No serious concern. No downgrade. Certainty remains Moderate.
Domain 4, Imprecision: The total sample across five studies is 482 participants. The 95% confidence interval (-6.1 to -2.3) lies entirely below the minimal clinically important difference threshold of -3 points, both ends of the interval suggest a clinically meaningful effect. The optimal information size for detecting a difference of 3 points is approximately 400 participants, and the total sample exceeds this. Judgment: No serious concern. No downgrade. Certainty remains Moderate.
Domain 5, Publication bias: Funnel plot assessment is limited with only 5 studies. However, all five studies were pre-registered on ClinicalTrials.gov, and two reported null findings for secondary outcomes, suggesting selective reporting is unlikely. No industry funding is involved. Judgment: No serious concern. No downgrade. Certainty remains Moderate.
Final rating: Moderate certainty. We are moderately confident that CBT reduces anxiety symptoms by approximately 4.2 points on the HAM-A compared with usual care. The true effect is likely close to this estimate, but lack of blinding in some studies means the effect could be somewhat smaller.
Documented rationale: "Downgraded by one level for risk of bias due to inadequate blinding of outcome assessors in 2 of 5 RCTs. Although these studies contributed less weight to the pooled estimate, the HAM-A is a clinician-rated scale susceptible to detection bias. No serious concerns for inconsistency, indirectness, imprecision, or publication bias."
This is the level of specificity every GRADE assessment requires. For guidance on conducting the underlying systematic review, see our complete systematic review guide, and for the meta-analysis that produces the pooled estimate assessed by GRADE, see our complete meta-analysis guide.
When to Use GRADE vs. GRADE-CERQual
GRADE was designed for quantitative evidence, effect estimates from RCTs and observational studies with numerical outcomes. When your systematic review synthesizes qualitative evidence (interview transcripts, focus group data, ethnographic observations), GRADE-CERQual (Confidence in the Evidence from Reviews of Qualitative Research) is the appropriate framework.
GRADE-CERQual assesses confidence in qualitative review findings across four domains: methodological limitations, coherence, adequacy of data, and relevance. It produces ratings of High, Moderate, Low, or Very Low confidence, parallel to GRADE but with domain-specific criteria suited to qualitative evidence.
| Feature | GRADE | GRADE-CERQual |
|---|---|---|
| Evidence type | Quantitative (RCTs, observational) | Qualitative (interviews, focus groups) |
| Unit of assessment | Body of evidence for an outcome | Review finding from qualitative synthesis |
| Domains | 5 downgrade + 3 upgrade | 4 assessment domains |
| Output | Certainty of evidence rating | Confidence in review finding |
| Presentation | Summary of Findings table | Summary of Qualitative Findings table |
| Tool | GRADEpro GDT | GRADEpro GDT (CERQual module) |
For mixed-methods systematic reviews that combine quantitative and qualitative evidence, you apply GRADE to the quantitative findings and GRADE-CERQual to the qualitative findings. The two assessments are then integrated in the discussion, not combined into a single rating.
Research Gold includes GRADE assessment and Summary of Findings tables in every systematic review. Learn about our complete SR process.