Is GRADE mandatory for systematic reviews?

GRADE is required by Cochrane, WHO, NICE, and many guideline-developing organizations. PRISMA 2020 recommends it. Even when not required, GRADE strengthens your review by providing transparent evidence ratings.

How many outcomes should I assess with GRADE?

Assess the 7 most important outcomes for decision-making. Cochrane recommends selecting these during protocol development, before seeing results.

What's the difference between certainty of evidence and quality of evidence?

They mean the same thing. GRADE previously used 'quality of evidence' but switched to 'certainty of evidence' to avoid confusion with methodological quality of individual studies.

Can I use GRADE for qualitative evidence?

GRADE was designed for quantitative evidence. For qualitative evidence, the GRADE-CERQual approach is used instead.

How do I decide between 'no downgrade' and 'downgrade by one'?

Use the GRADE guidance tables for each domain. Minor concerns across a domain warrant 'no downgrade.' Serious concerns warrant 'downgrade by one.' Very serious concerns warrant 'downgrade by two.'

Back to Blog

Evidence Synthesis

12 min read

GRADE Framework: How to Assess the Certainty of Evidence in Systematic Reviews

Q: What is the GRADE framework?

GRADE is a systematic, transparent framework for rating the certainty (quality) of evidence in systematic reviews. It evaluates evidence across 5 downgrade and 3 upgrade domains, producing ratings of High, Moderate, Low, or Very Low certainty.

Complete guide to applying the GRADE framework for rating certainty of evidence in systematic reviews, 5 downgrade domains, 3 upgrade domains, Summary of Findings tables, and a worked example with step-by-step domain assessment.

Dr. Sarah Mitchell

April 3, 2026

Rate your evidence certainty with our free GRADE assessment tool, structured domain-by-domain with export.

Key Takeaways

GRADE (Grading of Recommendations, Assessment, Development and Evaluations) is the globally accepted framework for rating the certainty of evidence in systematic reviews and clinical guidelines

GRADE rates evidence across 5 domains that can downgrade certainty: risk of bias, inconsistency, indirectness, imprecision, and publication bias

Evidence starts at "high" for RCTs and "low" for observational studies, then is downgraded (or upgraded) based on domain assessments

Need expert quality assessment for your review?

Our methodologists conduct dual-reviewer risk of bias assessments, GRADE certainty ratings, and publication-ready summary tables.

Explore our Evidence Synthesis Service starting from $750.

Get a Free Quote

Quality Assessment Support

Quality Assessment Takes Expertise. Our Team Does It Daily.

Rigorous risk of bias assessment, GRADE evaluations, and summary tables that satisfy peer reviewers. We handle the methodology so your review stands up to scrutiny.

4.9/5 rating1,194+ deliveredNDA protected

Get a Free Quote Pricing

Written by

Dr. Sarah Mitchell

PhD, Biostatistics & Research Methodology

Dr. Sarah Mitchell holds a PhD in Biostatistics from Johns Hopkins Bloomberg School of Public Health and has over 15 years of experience in systematic review methodology and meta-analysis. She has authored or co-authored 40+ peer-reviewed publications in journals including the Journal of Clinical Epidemiology, BMC Medical Research Methodology, and Research Synthesis Methods. A former Cochrane Review Group statistician and current editorial board member of Systematic Reviews, Dr. Mitchell has supervised 200+ evidence synthesis projects across clinical medicine, public health, and social sciences.

Learn more about our team

Risk of bias is GRADE Domain 1, make sure your assessment is correct using our RoB assessment overview.

Research Gold includes GRADE assessment and Summary of Findings tables in every systematic review. Learn about our complete SR process.

Quality Assessment Takes Expertise. Our Team Does It Daily.

Rigorous risk of bias assessment, GRADE evaluations, and summary tables that satisfy peer reviewers. We handle the methodology so your review stands up to scrutiny.

Get a Free Quote Explore Evidence Synthesis Service

Starting from $750. Final quote after we review your project.

The GRADE framework certainty evidence system is the most widely adopted method for rating how confident researchers and clinicians should be in a body of evidence from a systematic review. GRADE (Grading of Recommendations, Assessment, Development and Evaluations) provides a structured, transparent approach that separates the certainty of evidence from the strength of recommendations, a distinction that transformed how systematic reviews inform clinical practice and public health policy.

The GRADE framework (Grading of Recommendations, Assessment, Development and Evaluations) is a systematic approach for rating the certainty of evidence in systematic reviews and clinical guidelines. Adopted by over 110 organizations including Cochrane, WHO, and NICE, GRADE evaluates evidence across 5 downgrade domains (risk of bias, inconsistency, indirectness, imprecision, publication bias) and 3 upgrade domains, producing ratings of High, Moderate, Low, or Very Low certainty.

We apply GRADE assessments in every systematic review we deliver, the most common client question is why observational evidence starts at "low" rather than being rated on its own merits. This guide walks through every domain, every certainty level, and a complete worked example so you can apply the framework confidently. Rate your evidence certainty with our free GRADE assessment tool, structured domain-by-domain with export.

What Is the GRADE Framework?

Before GRADE, systematic review authors used inconsistent terminology and subjective judgments to describe the quality of their evidence. Some labeled evidence as "strong" or "weak" without defining what those terms meant. Others applied different criteria depending on the field, making it impossible to compare evidence ratings across reviews or disciplines.

The GRADE Working Group, established in 2000, developed the framework to solve this problem. Their goal was to create a single, transparent system that any researcher could apply consistently and any reader could interpret unambiguously. By 2026, GRADE has been adopted by over 110 organizations worldwide including Cochrane, WHO, NICE, and the CDC for rating the certainty of evidence in systematic reviews and clinical practice guidelines (GRADE Working Group).

It is critical to understand what GRADE is not. GRADE does not assess the methodological quality of individual studies, that is the role of risk of bias tools like RoB 2 or ROBINS-I, covered in our RoB assessment overview. GRADE operates at the outcome level, evaluating the entire body of evidence for a specific outcome across all included studies. A GRADE assessment asks: "How confident are we that the true effect lies close to the estimated effect?", not "Was this study well-conducted?"

This distinction matters because a single high-quality randomized controlled trial can provide high-certainty evidence, while ten poorly designed randomized controlled trials may provide only low-certainty evidence. The assessment is always about the body of evidence, not any one study.

The 5 Downgrade Domains in GRADE Assessment

Five GRADE downgrade domains with assessment criteria and severity — 5 downgrade domains: each can drop 1 or 2 levels. Source: GRADE Handbook (Schünemann 2013, upd. 2024).

Each GRADE assessment begins at a starting certainty level, High for randomized controlled trials, Low for observational studies, and then evaluates 5 domains that can each lower the certainty by one or two levels. Understanding these GRADE domains is the core skill required for any researcher conducting a systematic review.

Risk of bias is input to GRADE assessment at the domain level. Each domain is assessed as "no serious concern," "serious concern" (downgrade by one), or "very serious concern" (downgrade by two). The total downgrade across all 5 domains determines the final certainty of evidence rating.

Domain 1, Risk of Bias (Study Limitations)

The risk of bias domain evaluates whether the included studies have systematic flaws that could bias the effect estimate. For RCTs, this includes inadequate randomization, lack of blinding, incomplete outcome data, selective reporting, and other sources of bias. For observational studies, this includes confounding, selection bias, measurement error, and missing data.

GRADE assesses risk of bias across the body of evidence, not for individual studies. If most studies contributing to an outcome are at low risk of bias and a single small study is at high risk, the overall domain judgment may still be "no serious concern." Conversely, if the largest and most influential study has critical risk of bias, a downgrade is warranted even if smaller studies are sound.

The most common mistake here is double-counting. If you already assessed risk of bias using RoB 2 or the Newcastle-Ottawa Scale, the GRADE risk of bias domain draws on those results, it does not repeat the assessment. It synthesizes individual study ratings into a single judgment about the body of evidence. Use our risk of bias assessment tool to ensure your individual study assessments are correct before applying GRADE.

Domain 2, Inconsistency (Heterogeneity Across Studies)

The inconsistency domain evaluates whether the results of individual studies point in the same direction and have similar magnitude. When study results vary widely, some showing benefit, others showing harm, or effect sizes ranging from trivial to large, our confidence that the true effect lies near the pooled estimate decreases.

GRADE uses several indicators to assess inconsistency: the direction of effects across studies, the overlap of confidence intervals, statistical heterogeneity measured by I-squared (I-squared measures statistical heterogeneity in meta-analysis), and the results of subgroup analyses exploring the source of variation. An I-squared above 50% raises concern, but the number alone is insufficient, the clinical significance of the variation matters more than the statistical metric.

If heterogeneity can be explained by pre-specified subgroup analyses (for example, studies in children versus adults show different effects, and this difference is clinically plausible), you may decide not to downgrade. Instead, you present subgroup-specific estimates with their own GRADE ratings. Unexplained heterogeneity, however, warrants downgrading.

Domain 3, Indirectness (Generalizability Issues)

The indirectness domain assesses whether the evidence directly answers the review question. Indirectness occurs when the population, intervention, comparator, or outcome in the included studies differs from the question the review is trying to answer.

There are two types of indirectness. Direct indirectness occurs when studies use a proxy outcome instead of the outcome of interest, for example, using bone density as a surrogate for fracture risk. Indirect comparisons occur when you are comparing interventions A and B, but no head-to-head trials exist, so you rely on studies comparing A versus placebo and B versus placebo separately.

Indirectness is the domain most frequently overlooked by novice GRADE users. If your review includes studies conducted exclusively in high-income hospital settings but your review question concerns primary care in low-resource settings, the evidence is indirect even if the studies are perfectly designed.

Domain 4, Imprecision (Wide Confidence Intervals, Few Events)

The imprecision domain evaluates whether the effect estimate is precise enough to support a confident conclusion. Wide confidence intervals that cross clinically important thresholds, for example, an interval that includes both meaningful benefit and meaningful harm, indicate imprecision.

GRADE recommends assessing imprecision using optimal information size (OIS): would the total sample across included studies be large enough to detect a clinically important effect in a single adequately powered trial? If the total sample is smaller than the OIS, or if the confidence interval crosses the threshold for clinical decision-making, downgrade for imprecision.

For dichotomous outcomes, a total event count below 300 is a common threshold for concern. For continuous outcomes, GRADE suggests examining whether the confidence interval includes effects that would lead to different clinical decisions. The key question is whether the data provide enough precision to act on.

Domain 5, Publication Bias (Missing Studies)

The publication bias domain addresses the possibility that studies with null or negative results were never published, skewing the available evidence toward positive findings. A funnel plot detects publication bias by plotting study effect sizes against their precision, asymmetry suggests that small studies with unfavorable results may be missing.

Assessing publication bias is inherently difficult because you are trying to detect what is not there. GRADE recommends considering several factors: funnel plot asymmetry (when at least 10 studies are available), whether the included studies were funded by industry (which has a documented tendency toward selective reporting), whether unpublished data were sought, and whether study registries like ClinicalTrials.gov reveal completed but unreported studies.

In practice, publication bias is the domain where reviewers are most uncertain. When in doubt, GRADE suggests a "suspected" judgment with a downgrade by one level rather than ignoring the concern entirely.

The 3 Upgrade Domains for Observational Studies

Observational studies start at Low certainty in the GRADE framework, but three domains can upgrade the certainty by one or two levels. These upgrade domains are applied only after the 5 downgrade domains have been assessed and only when no downgrade has been applied for any domain, or when the evidence is so compelling that upgrading is justified despite downgrades.

Large magnitude of effect, When the effect size is large (relative risk greater than 2 or less than 0.5 in the absence of plausible confounders), GRADE permits upgrading by one level. Very large effects (RR greater than 5 or less than 0.2) can warrant upgrading by two levels. Evidence from randomized controlled trials starts at High certainty and can be downgraded, while observational evidence starts at Low and can be upgraded if the effect size is large (RR > 2), a dose-response gradient exists, or plausible confounding would reduce the observed effect (Schunemann et al., 2013).

Dose-response gradient, A clear dose-response relationship between exposure and outcome increases confidence that the association is causal. If higher doses consistently produce larger effects in a graded, predictable pattern, the certainty of evidence can be upgraded by one level.

Plausible confounding would reduce the observed effect, If all plausible residual confounders would bias the result toward the null (or toward a smaller effect), but a significant effect is still observed, this strengthens the causal inference. For example, if unmeasured confounders would be expected to reduce the association between smoking and lung cancer, but the association remains strong, confounding actually makes the true effect more likely to be at least as large as observed.

Certainty Level	Definition	What It Means for Decision-Making	Starting Point
High certainty	Very confident that the true effect lies close to the estimate	Strong basis for recommendations; unlikely to change with new research	randomized controlled trials (no downgrades)
Moderate certainty	Moderately confident; true effect is likely close to estimate	Reasonable basis for recommendations; new research may change confidence	randomized controlled trials (1 downgrade) or Observational (upgrades)

Column	Content
Outcome	Name of the outcome assessed
Number of participants (studies)	Total sample size and number of contributing studies
Relative effect (95% CI)	Risk ratio, odds ratio, or hazard ratio with confidence interval
Absolute effect (95% CI)	Risk difference or mean difference with confidence interval
Certainty	GRADE rating: High, Moderate, Low, or Very Low
Comments	Footnotes explaining downgrade/upgrade decisions

Feature	GRADE	GRADE-CERQual
Evidence type	Quantitative (RCTs, observational)	Qualitative (interviews, focus groups)
Unit of assessment	Body of evidence for an outcome	Review finding from qualitative synthesis
Domains	5 downgrade + 3 upgrade	4 assessment domains
Output	Certainty of evidence rating	Confidence in review finding
Presentation	Summary of Findings table	Summary of Qualitative Findings table
Tool	GRADEpro GDT	GRADEpro GDT (CERQual module)

GRADE Framework: How to Assess the Certainty of Evidence in Systematic Reviews

Key Takeaways

Quality Assessment Takes Expertise. Our Team Does It Daily.

Dr. Sarah Mitchell

Quality Assessment Takes Expertise. Our Team Does It Daily.

What Is the GRADE Framework?

The 5 Downgrade Domains in GRADE Assessment

Domain 1, Risk of Bias (Study Limitations)

Domain 2, Inconsistency (Heterogeneity Across Studies)

Domain 3, Indirectness (Generalizability Issues)

Domain 4, Imprecision (Wide Confidence Intervals, Few Events)

Domain 5, Publication Bias (Missing Studies)

The 3 Upgrade Domains for Observational Studies

GRADE Framework Certainty of Evidence Levels

Use GRADEpro GDT software

Don't downgrade more than 3 levels total

Separate GRADE from your risk of bias assessment

Frequently Asked Questions

Related Articles

How to Create a Summary of Findings Table

Common GRADE Assessment Mistakes

GRADE in Practice, A Worked Example

When to Use GRADE vs. GRADE-CERQual

Related Articles