Why most literature appraisal scores get challenged by reviewers
I reviewed a clinical evaluation last month where every study had been scored. The spreadsheet looked professional. Columns filled with numbers. Weighted averages at the bottom. The Notified Body rejected the entire appraisal section. Not because the studies were wrong. Because the scoring created a false sense of objectivity that hid fundamental problems with study relevance and bias.
In This Article
- What scoring systems promise versus what they deliver
- The core issue reviewers identify
- What relevance actually means in MDR context
- The bias assessment trap
- When scoring obscures gaps in the evidence base
- What reviewers want to see instead
- The role of scores in internal processes versus submission
- How this affects your timeline and workload
- When scores can be useful
- What this means for your next clinical evaluation
This pattern repeats across almost every second clinical evaluation I audit. Teams invest significant effort building scoring systems. They apply numerical scales to study design, sample size, relevance, bias. They calculate totals. They rank studies. They set acceptance thresholds.
Then reviewers challenge the entire approach.
The problem is not that scoring is inherently wrong. The problem is that most scoring systems answer the wrong questions. They create a veneer of rigor while obscuring what actually matters for regulatory decision-making under MDR.
What scoring systems promise versus what they deliver
Scoring systems promise objectivity. They promise reproducibility. They promise a structured way to compare studies and justify inclusion or exclusion decisions.
What they actually deliver is often different.
Most scoring templates assess generic quality dimensions. Study design hierarchy. Sample size adequacy. Statistical power. Blinding. Randomization. Follow-up duration. These are borrowed from evidence-based medicine frameworks designed for therapeutic interventions.
But clinical evaluations under MDR are not systematic reviews for treatment guidelines. The questions are different. The context is different. The relevance criteria must be different.
Using generic quality scoring tools like Newcastle-Ottawa or Cochrane Risk of Bias without adapting criteria to device-specific relevance, intended use, and risk profile. Reviewers see numbers but no device-specific justification.
I see this constantly. A study gets high scores because it is a randomized controlled trial with low bias. But the device used in that study differs in material composition. Or the patient population differs significantly. Or the clinical endpoints measured do not align with the claimed benefits of the device under evaluation.
The score says the study is high quality. But the study may still be irrelevant or insufficient for demonstrating safety and performance under Article 61 of MDR.
The core issue reviewers identify
Notified Bodies and competent authorities are trained to evaluate clinical evidence from a regulatory perspective. Their primary question is not whether a study is methodologically sound in general terms.
Their primary question is whether this specific study provides valid evidence for this specific device in this specific intended use with this specific patient population addressing these specific risks and benefits.
Generic scoring systems do not answer that question.
What happens in practice is that manufacturers score studies on dimensions that seem scientific. Then they present aggregated scores as if higher numbers mean better evidence for regulatory sufficiency.
Reviewers see through this immediately.
They ask: Why is sample size weighted equally for a low-risk device versus a high-risk device? Why is randomization scored highly when the comparison group is irrelevant to your device? Why does this study score 85 out of 100 but fail to address the key residual risk identified in your risk management file?
Reviewers do not reject scoring because they dislike numbers. They reject it because numerical scores often mask the absence of device-specific critical appraisal and relevance justification.
What relevance actually means in MDR context
MDCG 2020-5 and MDCG 2020-6 emphasize that clinical data must be relevant to the device under evaluation. Relevance is not a single score. It is a multi-dimensional assessment.
Does the study device match yours in technical characteristics? Does it match in biological characteristics if applicable? Does the intended purpose align? Does the patient population reflect your target population or can differences be justified? Do the clinical endpoints address the claims and the risks?
Each of these dimensions requires explicit justification. A scoring system that assigns 3 points for “similar patient population” does not explain how you determined similarity. It does not justify why age range differences are acceptable. It does not address comorbidity patterns or severity staging.
Reviewers expect narrative justification with references to your Technical Documentation and Risk Management File. They expect you to explain your reasoning. A number does not explain reasoning.
The bias assessment trap
Another area where scoring systems create problems is bias assessment.
Standard tools assess selection bias, performance bias, detection bias, attrition bias, reporting bias. These are important in clinical research. But in regulatory evaluation of medical device literature, other bias dimensions often matter more.
Commercial bias. Was the study funded by a competitor with motivation to show negative results or by the manufacturer with motivation to show positive results? Publication bias. Are there unpublished studies with less favorable outcomes? Applicability bias. Was the study conducted in a controlled academic center that does not reflect real-world use conditions?
I have seen clinical evaluations where studies scored low on traditional bias scales but were actually the most relevant evidence available. And I have seen studies score high but carry significant commercial or applicability bias that was never addressed.
Scoring templates rarely capture these device-specific and regulatory-specific bias dimensions.
Applying bias assessment tools designed for pharmaceutical trials without considering device-specific biases like learning curve effects, operator dependence, or institutional volume effects.
When scoring obscures gaps in the evidence base
Perhaps the most serious problem with scoring systems is that they can create false confidence.
You appraise twenty studies. Fifteen score above your acceptance threshold. Five score below and are excluded. You present a table showing that the included studies have an average quality score of 82 percent.
This looks rigorous. But it may hide fundamental gaps.
Do those fifteen studies actually cover all the significant residual risks identified in your risk management? Do they address the specific patient subgroups in your intended use? Do they provide data on long-term performance if your device is intended for chronic use?
Scoring individual studies does not assess the completeness and sufficiency of the entire evidence base. That requires a different type of analysis. One that maps evidence to risks, to claims, to intended use, and to gaps.
Reviewers perform this mapping. When they do, they often find that despite high individual study scores, the evidence base has critical gaps. The manufacturer missed this because they focused on scoring studies rather than mapping evidence to regulatory requirements.
What reviewers want to see instead
Reviewers want structured critical appraisal. But structure does not require numerical scoring.
What works better in my experience is a narrative appraisal for each study that addresses specific dimensions:
Relevance justification: Explicit comparison of study device technical and biological characteristics to your device. Explicit comparison of patient population. Explicit comparison of intended use and clinical context. Explicit justification of why any differences are acceptable or how you account for them.
Methodological assessment: Description of study design. Identification of strengths and limitations relevant to the regulatory question being asked. Assessment of whether the endpoints measured align with your claimed benefits and identified risks.
Bias and applicability: Discussion of potential biases including commercial interests, publication status, and real-world applicability. Discussion of whether results are generalizable to your intended use conditions.
Weight of evidence: Instead of a numerical score, a qualitative statement about what this study contributes to your overall evidence base and what it does not address.
This approach requires more thinking. It requires understanding your device. It requires understanding the regulatory requirements. But it produces appraisals that withstand review because the reasoning is transparent.
The goal of literature appraisal in clinical evaluation is not to rank studies. It is to demonstrate that you understand what each piece of evidence contributes to demonstrating safety and performance, and what it does not contribute.
The role of scores in internal processes versus submission
Some teams use scoring systems internally during initial screening. This can be useful for efficiency when reviewing large volumes of literature. You need some way to prioritize which studies to appraise in detail first.
That is fine for internal workflow.
The mistake is including these internal screening scores in the clinical evaluation report submitted to the Notified Body. Once you include scores in the submission, reviewers will scrutinize the scoring methodology. They will ask why you chose those criteria. They will ask why you weighted them that way. They will ask how the scores translate to regulatory sufficiency.
If you cannot answer those questions convincingly, the scores become a liability rather than an asset.
In my practice, I recommend keeping scoring as an internal tool only. In the clinical evaluation report, present structured narrative appraisals that transparently explain the relevance and limitations of each study in the context of your specific device and regulatory requirements.
How this affects your timeline and workload
I am often asked whether narrative appraisal takes more time than scoring.
Initially, yes. Writing clear justifications for relevance and applicability requires thought. It requires understanding your Technical Documentation. It requires mapping to your Risk Management File.
But in the review cycle, narrative appraisal saves time.
Scoring systems generate deficiencies. You receive questions challenging your methodology. You receive questions about individual scores. You receive questions about why certain studies were included or excluded based on arbitrary thresholds.
Each deficiency response cycle adds weeks or months to your timeline. Each round of clarification reveals that the scoring system did not actually address the regulatory questions.
Narrative appraisal done correctly the first time reduces these iterations. Reviewers can follow your reasoning. They can agree or disagree with specific judgments, but they understand the basis for those judgments. This leads to more focused deficiencies about evidence gaps rather than methodological debates about scoring.
When scores can be useful
I do not want to suggest that numerical assessment has no place in clinical evaluation.
There are situations where semi-quantitative approaches add value. For example, when comparing multiple similar studies to identify which provides the most relevant data for a specific endpoint. Or when tracking literature over time in post-market clinical follow-up to identify trends in evidence quality.
But even in these cases, the numbers should support the narrative, not replace it.
And the scoring criteria must be explicitly tailored to device-specific relevance and regulatory requirements. Not borrowed uncritically from generic appraisal tools designed for different purposes.
If you use scoring, the methodology and justification for the scoring system must be as robust as the appraisal itself. If you cannot defend the scoring system under scrutiny, do not include it in your submission.
What this means for your next clinical evaluation
If you are currently using scoring systems in your clinical evaluations, review them critically.
Ask yourself: Does this score answer the question a reviewer will ask? Does it demonstrate relevance to my specific device? Does it justify why this study contributes to demonstrating safety and performance under MDR?
If the answer is no, consider whether the scoring adds value or simply adds risk of deficiencies.
In most cases, a well-structured narrative appraisal that explicitly addresses relevance, methodology, bias, and contribution to the evidence base will be stronger and more defensible than numerical scores.
This does not mean less rigor. It means rigor applied to the right questions. The questions that regulators and Notified Bodies actually need answered to fulfill their obligations under Articles 52 and 61 of MDR.
Your clinical evaluation is not an academic exercise. It is a regulatory document that must demonstrate sufficient clinical evidence for the safety and performance of your device. Keep that purpose central to how you appraise literature.
And remember that no scoring system can replace clear thinking about what evidence you need, what evidence you have, and what gaps remain.
Peace,
Hatem
Clinical Evaluation Expert for Medical Devices
Follow me for more insights and practical advice.
Frequently Asked Questions
What is a Clinical Evaluation Report (CER)?
A CER is a mandatory document under MDR 2017/745 that demonstrates the safety and performance of a medical device through systematic analysis of clinical data. It must be updated throughout the device lifecycle based on PMCF findings.
How often should the CER be updated?
The CER should be updated whenever significant new clinical data becomes available, after PMCF activities, when there are changes to the device or intended purpose, and at minimum during annual reviews as part of post-market surveillance.
What causes CER rejection by Notified Bodies?
Common reasons include inadequate equivalence demonstration, insufficient clinical data for claims, poorly structured SOTA analysis, missing gap analysis, and lack of clear benefit-risk determination. Structure and logical flow are as important as the data itself.
Which MDCG guidance documents are most relevant for clinical evaluation?
Key documents include MDCG 2020-5 (Equivalence), MDCG 2020-6 (Sufficient Clinical Evidence), MDCG 2020-13 (CEAR Template), MDCG 2020-7 (PMCF Plan), and MDCG 2020-8 (PMCF Evaluation Report).
Need Expert Help with Your Clinical Evaluation?
Get personalized guidance on MDR compliance, CER writing, and Notified Body preparation.
✌
Peace, Hatem
Your Clinical Evaluation Partner
Follow me for more insights and practical advice.
– Regulation (EU) 2017/745 Article 61 (Clinical Evaluation)
– MDCG 2020-5 Clinical evaluation – Equivalence
– MDCG 2020-6 Sufficient clinical evidence for legacy devices
– MDCG 2020-13 Clinical evaluation assessment report template





