The Newcastle-Ottawa Scale is a quality assessment tool for evaluating non-randomized studies in systematic reviews, developed by Wells et al. at the Ottawa Hospital Research Institute. It uses a star-based scoring system across three domains, Selection, Comparability, and Outcome (for cohort studies) or Exposure (for case-control studies), awarding a maximum of 9 stars to classify studies as high, moderate, or low quality.

If your systematic review includes observational studies, you need a validated method for assessing their methodological quality. The Newcastle-Ottawa Scale guide you are reading covers every aspect of NOS assessment: the three scoring domains, item-by-item evaluation criteria, differences between cohort and case-control versions, quality thresholds, and how NOS compares to ROBINS-I. Whether you are conducting your first quality assessment or refining your approach, this guide provides the practical knowledge you need to apply NOS correctly and consistently.

What Is the Newcastle-Ottawa Scale?

The Newcastle-Ottawa Scale was developed by Wells et al. at the Ottawa Hospital Research Institute and the University of Newcastle, Australia, as a tool for assessing the quality of non-randomized studies included in systematic reviews. First published in the early 2000s and widely adopted since, NOS has become one of the most cited quality assessment instruments in evidence synthesis.

NOS uses a star-based scoring system that awards between 0 and 9 stars across three broad domains. Each star represents adequate fulfillment of a specific methodological criterion. The simplicity of this approach, awarding or withholding a star for each item, makes NOS faster to complete than more complex tools while still capturing the core dimensions of study quality.

The scale operates on a principle of methodological adequacy rather than perfection. A study does not need flawless methodology to receive a star, it needs to meet a minimum threshold of methodological soundness for each criterion. This pragmatic approach reflects the reality that observational studies inherently carry more bias risk than randomized controlled trials, and quality assessment should differentiate between studies that took reasonable precautions and those that did not.

NOS is endorsed by the Cochrane Collaboration as an acceptable tool for assessing non-randomized studies (Higgins et al., 2023, Cochrane Handbook Chapter 25). Its widespread use means that reviewers, editors, and readers are familiar with NOS scores, making your quality assessment results immediately interpretable to your audience.

The 3 NOS Domains and Scoring

The NOS scoring system distributes its maximum 9 stars across three domains. Understanding what each domain evaluates, and how stars are awarded, is essential for consistent, defensible quality assessment.

Selection (4 Stars)

The Selection domain evaluates whether the study identified and enrolled participants in a way that minimizes selection bias. Four stars are available, each addressing a different aspect of how participants were selected and defined.

For cohort studies, the four Selection items assess: representativeness of the exposed cohort, selection of the non-exposed cohort, ascertainment of exposure, and demonstration that the outcome of interest was not present at the start of the study. For case-control studies, the items assess: adequacy of case definition, representativeness of cases, selection of controls, and definition of controls.

A study earns one star per criterion when it demonstrates adequate methodology. For example, a cohort study earns a Selection star for exposure ascertainment if exposure was measured using a validated instrument or secure medical record rather than self-report alone.

Comparability (2 Stars)

The Comparability domain is the most subjective component of NOS and the domain most frequently misapplied. It evaluates whether the study controlled for confounding variables, awarding up to 2 stars based on the adjustment strategy.

One star is awarded if the study controls for the single most important confounder. A second star is awarded if the study also controls for any additional important confounder. The critical requirement here is that you, as the reviewer, must pre-specify which confounders qualify before beginning your assessment. For example, in a study examining smoking and lung cancer, age might be designated the most important confounder and sex the second.

This domain requires you to state in your systematic review protocol which confounders earn the first and second star. Without pre-specification, you risk post-hoc rationalization, deciding after seeing results which confounders matter, which undermines the objectivity of your assessment.

Outcome/Exposure (3 Stars)

The third domain evaluates the quality of outcome measurement (in cohort studies) or exposure measurement (in case-control studies), awarding up to 3 stars.

For cohort studies, the three Outcome items assess: method of outcome assessment, length of follow-up, and adequacy of follow-up (attrition). For case-control studies, the three Exposure items assess: ascertainment of exposure, same method of ascertainment for cases and controls, and non-response rate.

A cohort study earns a star for outcome assessment when outcomes are verified through independent blind assessment or record linkage rather than self-report. It earns a follow-up star when the duration is sufficient for outcomes to occur, and an adequacy star when the proportion lost to follow-up is acceptable (commonly less than 20%).

The following table summarizes the items evaluated in each domain for both study types:

DomainCohort Study ItemsCase-Control Study ItemsMax Stars
SelectionRepresentativeness of exposed cohort, Selection of non-exposed cohort, Ascertainment of exposure, Outcome not present at startAdequacy of case definition, Representativeness of cases, Selection of controls, Definition of controls4
ComparabilityControls for most important confounder, Controls for additional confounderControls for most important confounder, Controls for additional confounder2
Outcome/ExposureAssessment of outcome, Length of follow-up, Adequacy of follow-upAscertainment of exposure, Same method for cases and controls, Non-response rate3

NOS for Cohort Studies vs Case-Control Studies

The Newcastle-Ottawa Scale exists in two primary versions: one for cohort studies and one for case-control studies. While the three-domain structure and maximum 9 stars remain identical, the specific items within each domain differ to reflect the distinct methodological concerns of each study design.

FeatureNOS Cohort VersionNOS Case-Control Version
Selection focusExposed and non-exposed cohort identificationCase and control identification
Third domainOutcome assessmentExposure assessment
Follow-up itemsLength and adequacy of follow-upNon-response rate
Temporal directionProspective/retrospective follow-upRetrospective exposure assessment
Common confoundersAge, baseline disease severityAge, sex, matching variables

The cohort version emphasizes longitudinal follow-up, whether participants were tracked long enough for outcomes to develop and whether attrition was acceptable. The case-control version focuses on whether exposure was ascertained identically in cases and controls, because differential exposure measurement is the primary source of information bias in case-control designs.

Cross-sectional study adaptation is a common need that the original NOS does not officially address. Several research groups have published modified NOS versions for cross-sectional studies, most notably the adaptation by Herzog et al. (2013). These modified scales retain the three-domain structure but replace follow-up items with items relevant to cross-sectional designs, such as sample size justification and statistical adjustment. While widely used, these adaptations are not officially validated by Wells et al. and should be cited separately from the original NOS.

When your systematic review includes both cohort and case-control studies, apply the appropriate NOS version to each study type. Report NOS scores separately by study design in your results, as a 7-star cohort study and a 7-star case-control study have met different criteria and are not directly comparable on individual domain items.

How to Score and Interpret the Newcastle-Ottawa Scale

Applying the NOS quality assessment consistently requires a structured approach. Inconsistent scoring between reviewers is the most common criticism of NOS (Stang, 2010), and following a systematic process minimizes this problem.

Step 1: Pre-specify your criteria. Before assessing any study, document in your protocol which confounders qualify for Comparability stars, what follow-up duration is adequate, and what attrition threshold is acceptable. These decisions should be based on your review question and the clinical context, not on what the included studies happen to report.

Step 2: Assess each item independently. Work through each NOS item one at a time. For each item, determine whether the study meets the criterion for a star based on what the authors reported. If information is missing or unclear, the study does not earn the star, do not assume adequate methodology when it is not documented.

Step 3: Use two independent assessors. Two reviewers should assess each study independently, then compare scores. Calculate inter-rater agreement using Cohen kappa or percentage agreement. Resolve discrepancies through discussion or a third reviewer.

Step 4: Apply quality thresholds. The most commonly used quality thresholds for NOS are:

Total NOS StarsQuality Classification
7-9 starsHigh quality
4-6 starsModerate quality
0-3 starsLow quality

These thresholds should be pre-specified in your protocol and applied consistently. Some reviews use alternative cut-points (such as 6+ for high quality), which is acceptable as long as the threshold is justified and declared prospectively.

Step 5: Record domain-level scores. Report both the total score and the breakdown by domain. A study scoring 7 total stars with 4-2-1 (strong Selection, strong Comparability, weak Outcome) has a fundamentally different quality profile than a study scoring 7 with 2-2-3 (weak Selection, strong Comparability, strong Outcome). Domain-level transparency allows readers to evaluate where quality concerns concentrate.

Pre-specifying your NOS criteria in the protocol is not optional, it is a methodological safeguard that prevents bias in the quality assessment itself. Cochrane Handbook Chapter 25 (Higgins et al., 2023) emphasizes that all quality assessment criteria should be established before data extraction begins.

Newcastle-Ottawa Scale vs ROBINS-I

Choosing between NOS and ROBINS-I is one of the most common decisions systematic reviewers face when planning quality assessment for non-randomized studies. Both tools assess observational study quality, but they differ fundamentally in approach, complexity, and output.

FeatureNewcastle-Ottawa ScaleROBINS-I
ScoringStar-based (0-9 stars)Domain judgements (Low/Moderate/Serious/Critical)
Domains3 (Selection, Comparability, Outcome/Exposure)7 bias domains
Time per study10-15 minutes30-60 minutes
OutputNumeric scoreOverall risk of bias judgement
Target trial conceptNoYes, requires specifying hypothetical RCT
Training requiredMinimalSubstantial
Best forQuick screening, large reviewsRigorous assessment, Cochrane reviews

NOS advantages. Speed is the primary advantage. For reviews including 30 or more observational studies, NOS allows efficient quality assessment without dedicating weeks to the task. The star-based system produces a numeric score that is easy to use in meta-regression or subgroup analyses. NOS is also simpler to learn, requiring less training than ROBINS-I.

ROBINS-I advantages. ROBINS-I provides more granular, structured assessment through its seven bias domains. The target trial framework, where you specify the hypothetical randomized trial that each observational study is attempting to emulate, forces explicit consideration of confounding, selection bias, and measurement bias at each stage. ROBINS-I domain judgements (Low, Moderate, Serious, Critical risk of bias) are more informative than a single numeric score.

When to choose NOS. Use NOS when your review includes many observational studies, when your journal or guideline body accepts NOS, or when time constraints favor a faster tool. NOS remains the most commonly used tool in published systematic reviews of observational studies.

When to choose ROBINS-I. Use ROBINS-I when conducting a Cochrane review that includes non-randomized studies, when your review includes a small number of key studies warranting deep assessment, or when you need domain-level risk of bias judgements for GRADE assessments. For a detailed walkthrough, see our ROBINS-I assessment guide.

Both tools have limitations. NOS has been criticized for inter-rater variability and lack of clear guidance on several items (Stang, 2010). ROBINS-I requires substantial training and can be time-prohibitive for large reviews. The best choice depends on your review scope, timeline, and the expectations of your target journal.

Using NOS Results in Your Systematic Review

NOS results are not an endpoint, they are an input to downstream analytical decisions. How you use quality assessment scores determines whether they add value to your systematic review or merely occupy a table in the appendix.

Sensitivity analysis by quality. The most important use of NOS scores is informing sensitivity analyses. Run your primary meta-analysis including all studies, then repeat it excluding studies classified as low quality (0-3 stars). If the pooled effect estimate changes substantially, your conclusions are sensitive to study quality, a critical finding that must be reported.

Subgroup analysis. Present subgroup analyses stratified by NOS quality classification (high, moderate, low). This reveals whether effect estimates differ systematically by study quality. If high-quality studies show smaller effect sizes than low-quality studies, publication bias or methodological bias may be inflating the overall estimate.

Meta-regression. For reviews with sufficient studies (generally 10 or more), NOS total scores can serve as a covariate in meta-regression to test whether study quality predicts effect size. This provides a formal statistical test of the relationship between methodological rigor and study findings.

GRADE integration. When assessing certainty of evidence using the GRADE framework, NOS results directly inform the risk of bias domain. If most studies score below your pre-specified quality threshold, you may downgrade the certainty of evidence by one or two levels for risk of bias. Conversely, consistently high NOS scores support maintaining the evidence certainty level. For a thorough understanding of how quality assessment feeds into evidence certainty, consult our complete risk of bias guide.

Reporting. Present NOS results in a summary table showing each included study, its domain scores, and total score. Include the pre-specified quality thresholds and the criteria used for Comparability stars. Transparent reporting allows readers to evaluate your quality assessment decisions and, if they disagree with your thresholds, re-classify studies according to their own criteria.

Common Newcastle-Ottawa Scale Mistakes

Even experienced reviewers make errors when applying the Newcastle-Ottawa Scale. Recognizing these common mistakes helps you avoid them and produce a more defensible quality assessment.

Using the wrong version for the study design. Applying the cohort NOS to a case-control study, or vice versa, produces meaningless scores because the items evaluate different methodological features. Before scoring, confirm each study design and select the appropriate NOS version. For studies with ambiguous designs, classify the study design first using established criteria (such as those in the STROBE statement), then apply the matching NOS version.

Not pre-specifying Comparability criteria. The Comparability domain requires you to name the most important and second most important confounders before beginning assessment. Reviewers who skip this step often award Comparability stars inconsistently, giving credit for different confounders across studies based on what each study happened to control for. This introduces assessor bias into the quality assessment.

Treating NOS as a binary good/bad classification. NOS produces a 10-point scale (0-9 stars). Collapsing this into "high quality" versus "low quality" using a single threshold discards useful information. Always report domain-level scores alongside the total, and use multiple thresholds in sensitivity analyses rather than a single cut-point.

Awarding Comparability stars for crude analyses. A study that reports only crude (unadjusted) associations has not controlled for any confounders and should receive 0 Comparability stars, regardless of how well-designed the rest of the study appears. Some reviewers mistakenly award Comparability stars when the study mentions confounders in the Discussion without actually adjusting for them in the analysis.

Ignoring missing information. When a study does not report how exposure was ascertained, what the follow-up duration was, or what the attrition rate was, the study should not receive the corresponding star. Do not assume adequate methodology based on the study being published in a reputable journal. NOS scores should reflect what is documented, not what is presumed.

Inconsistent application across studies. Apply the same criteria to every study in your review. If you require a minimum 12-month follow-up for one cohort study, require it for all. Inconsistent application produces quality scores that reflect assessor variability rather than genuine quality differences between studies.

Failing to report inter-rater agreement. Two independent reviewers should assess each study, and you should report the level of agreement between them. High inter-rater agreement strengthens confidence in your quality assessment, while low agreement signals that your NOS criteria may need clarification. Calculate Cohen kappa and report it in your methods section.

By avoiding these mistakes and following a structured, pre-specified approach, your NOS assessment will produce quality classifications that are transparent, reproducible, and defensible during peer review. For additional context on how quality assessment fits within the broader systematic review methodology, see our complete risk of bias guide.