Appraising Evidence for AI Medical Software

Hatem Rabeh

Written by HATEM RABEH, MD, MSc Ing

Your Clinical Evaluation Expert And Partner

in
S

You have collected studies, validation reports, and performance data. Now you must appraise it. For AI software, standard appraisal frameworks are not enough. You need criteria adapted to the specific characteristics of algorithmic evidence. Here is how to evaluate what you have.

Evidence appraisal for AI software evaluates three dimensions for every study and dataset: scientific validity, device relevance, and methodological quality. MEDDEV 2.7/1 rev 4 establishes the framework. MDCG 2020-1 adds software-specific requirements. Your task is to apply both systematically.

The Three-Pillar Appraisal

Every piece of evidence must be evaluated against the three pillars from MDCG 2020-1:

Valid Clinical Association. Does the study demonstrate that the software output relates to the clinical condition? Look for correlation with accepted reference standards, clinical validation of the underlying relationship, and evidence that the measured variable matters for patient outcomes.

Analytical Performance. Does the study demonstrate that the software performs its technical function correctly? Look for verification and validation data, accuracy and precision under controlled conditions, and performance specifications with defined tolerances.

Clinical Performance. Does the study demonstrate that the software works in real clinical settings? Look for external validation, performance with actual users, and results that reflect intended use conditions.

Key Insight
A study may be excellent for one pillar and useless for another. Appraise each pillar separately. Do not let strong analytical performance obscure weak clinical validation.

Appraisal Criteria for AI Studies

For each study or dataset, evaluate these AI-specific factors:

Data Representativeness. Does the training and test data represent your intended population? Geographic diversity, demographic coverage, disease spectrum, image quality variation. Narrow data limits generalizability.

External Validation. Was the algorithm tested on truly independent data? Different sites, different time periods, different equipment. Internal validation alone is insufficient for clinical claims.

Reference Standard Quality. How was ground truth established? Expert consensus, pathology confirmation, clinical follow-up. Weak reference standards undermine even strong performance metrics.

Subgroup Analysis. Are results reported for clinically relevant subgroups? Age, sex, disease severity, comorbidities. Aggregate metrics can hide poor performance in specific populations.

Calibration. If the software outputs probabilities, are they well-calibrated? A predicted 80% probability should be correct about 80% of the time. Miscalibration affects clinical decision-making.

Evidence Weighting

Not all evidence carries equal weight. Apply these principles:

Claim Relevance. Studies matching your claims precisely in population and environment receive highest weight. Studies with different populations or settings receive lower weight.

Design Quality. External validation ranks above internal testing. Prospective studies rank above retrospective for clinical performance claims. Controlled conditions rank above real-world for analytical claims.

Reporting Completeness. Complete case flows, clear reference standards, subgroup analysis, and calibration reporting strengthen weight. Missing information weakens it.

Red Flags
Absent external validation. Unclear reference standards. Performance only on ideal inputs. Missing calibration or subgroup results. Undefined workflow and oversight.

Decision Rules

Supported claims require at least one high-weight dataset meeting primary endpoints with no conflicting high-weight evidence.

Conditional claims rely solely on medium-weight datasets or show missing subgroups. These may proceed to market with robust PMCF commitments.

Unsupported claims involve only low-weight evidence or critical safety endpoint failures. These claims cannot be made.

Document your appraisal in a structured matrix. For each study, record pillar relevance, design quality, data reporting quality, and overall weight. This creates the audit trail reviewers expect.

In the next post, we cover how to structure your CER to present this evidence effectively.

Peace,
Hatem
Your Clinical Evaluation Partner

Frequently Asked Questions

How do I handle studies with mixed quality?

Evaluate each study dimension separately. A study may have strong methodology but weak relevance to your population. Document both strengths and limitations. Use the overall weight to determine how much the study contributes to your conclusions.

What if no high-weight evidence exists for a claim?

Consider whether the claim should be modified, supported with conditional language, or addressed through PMCF. Reviewers accept gap acknowledgment with mitigation plans. They do not accept unsupported claims presented as demonstrated.

Series: AI Medical Device Clinical Evaluation

Part 4 of 6

Coming Soon

Structuring the CER for AI Medical Software

Need Expert Help with Your Clinical Evaluation?

Get personalized guidance on MDR compliance, CER writing, and Notified Body preparation.

Peace, Hatem

Your Clinical Evaluation Partner

Follow me for more insights and practical advice.

References:
– MEDDEV 2.7/1 Rev 4
– MDCG 2020-1: Guidance on Clinical Evaluation for Medical Device Software
– TRIPOD-AI Reporting Guidelines