AI Software as Medical Device: Clinical Evaluation Under MDR

Hatem Rabeh

Written by HATEM RABEH, MD, MSc Ing

Your Clinical Evaluation Expert And Partner

in
S

A machine learning algorithm receives CE mark approval. Six months later, the manufacturer updates the training dataset. The clinical evaluation isn’t revised. The Notified Body issues a nonconformity during surveillance. The manufacturer argues the device is unchanged. The argument fails.

This scenario repeats across the industry. AI-based medical devices present clinical evaluation challenges that traditional devices never created. The software doesn’t change physically, but its clinical behavior can drift. The training data evolves. The performance shifts with real-world population variance.

Manufacturers assume that because the algorithm architecture stays constant, the clinical evaluation remains valid. This assumption collapses under MDR scrutiny.

Why SaMD Clinical Evaluation Differs From Hardware Devices

The MDR doesn’t distinguish between software and hardware in its fundamental requirements. Annex XIV still demands demonstration of safety and performance. Article 61 still requires clinical data to support conformity.

But the nature of software evidence creates friction points that hardware manufacturers never encounter.

Traditional devices have fixed characteristics. A hip implant maintains its material properties. A surgical instrument doesn’t alter its mechanical advantage between patients. The clinical evaluation addresses a stable entity.

AI medical software operates differently. The same algorithm version may perform variably across subpopulations. Performance metrics depend on input data quality. Edge cases emerge in deployment that weren’t present in validation datasets.

Key Insight
The clinical evaluation for AI SaMD must address not just what the software does, but how its performance varies across the intended use population and what happens when it encounters data outside its training distribution.

Most manufacturers build their clinical evaluation as if software were static. They demonstrate performance on validation datasets. They reference publications on similar algorithms. They submit this to the Notified Body expecting approval.

The deficiency notice arrives because the evaluation never addressed clinical behavior under real-world variability.

The Clinical Data Problem for AI Medical Devices

Article 61(1) requires sufficient clinical data. For traditional devices, this data comes from clinical investigations or literature on equivalent devices.

AI software complicates both pathways.

Clinical investigations on software rarely follow traditional trial structures. There’s no implantation. No surgical procedure. Often no direct patient contact. The software assists clinician decision-making or automates image analysis.

How do you design a clinical investigation that isolates the software’s contribution to clinical outcomes?

A diagnostic algorithm that detects diabetic retinopathy provides output to the clinician. The clinician makes treatment decisions. If patient outcomes improve, was it the algorithm’s accuracy or the clinician’s response to the information?

Literature searches face different challenges. Publications on machine learning models often focus on technical performance metrics. Sensitivity, specificity, AUC values on test datasets. These metrics don’t directly translate to clinical benefit demonstration under MDR.

Common Deficiency
Clinical evaluations that rely exclusively on algorithm performance metrics without connecting these metrics to clinical outcomes or patient benefit. Notified Bodies consistently reject this approach as insufficient clinical data under Article 61.

The clinical evaluation must explain the chain of evidence. The algorithm achieves X sensitivity on the validation dataset. This sensitivity level enables clinicians to detect Y condition at Z stage. Detection at Z stage leads to treatment initiation that produces measurable clinical benefit.

Without this chain, you have technical validation, not clinical evidence.

Equivalence Claims for AI Software

Some manufacturers attempt equivalence routes. They identify clinically similar devices. They demonstrate technical and biological equivalence per MDCG 2020-5.

This strategy fails more often than it succeeds for AI medical software.

Technical equivalence requires similar technical characteristics. For software, this includes algorithm architecture, training methodology, input data specifications, and output characteristics.

Two neural networks that both analyze chest X-rays appear similar. But if one was trained on Asian populations and another on European populations, their performance characteristics differ. If one uses a convolutional architecture and another uses a transformer model, their behavior under edge cases diverges.

The technical similarity that looks obvious at surface level disappears under detailed analysis.

Biological equivalence compounds the problem. Software doesn’t contact tissue. But it influences clinical decisions that lead to biological consequences. The biological equivalence assessment must address the clinical pathway, not just the device function.

A software that recommends biopsy based on image analysis creates different biological risks than one that adjusts insulin pump delivery rates. Both are software. Neither has biological equivalence to the other.

Key Insight
Equivalence claims for AI SaMD typically fail because manufacturers focus on functional similarity rather than clinical behavior similarity. The evaluation must demonstrate that the devices behave equivalently across the full intended use population under real-world conditions.

The SOTA Challenge for Machine Learning Models

The state of the art analysis under Annex I Section 1 creates particular challenges for AI medical devices.

Medical AI evolves rapidly. New architectures emerge. Training techniques improve. Benchmark datasets grow. What represented state of the art eighteen months ago during development may be outdated by the time of technical documentation submission.

The SOTA analysis must address this temporal gap honestly.

I’ve reviewed evaluations that cite papers from five years prior as current state of the art. The manufacturer developed the device three years ago using methods that were reasonable then. But the SOTA section doesn’t acknowledge that newer approaches exist.

The Notified Body reviewer finds recent publications showing improved performance with different methods. The deficiency notice questions why the manufacturer didn’t adopt these approaches.

The manufacturer’s defense that their method was SOTA during development doesn’t satisfy the requirement. Annex I requires considering SOTA at the time of placing on the market, not at the time of initial development.

This creates a difficult position. The manufacturer cannot completely redevelop the algorithm every time a new paper appears. But the clinical evaluation must acknowledge current SOTA and justify why the device approach remains acceptable.

Common Deficiency
SOTA analyses that stop at describing the device’s technical approach without comparing it to current alternatives. The evaluation must explicitly address why the chosen approach remains appropriate given recent advances, or demonstrate that recent advances don’t offer clinically meaningful improvements for the intended use.

Version Control and Clinical Evaluation Maintenance

Software versions create clinical evaluation maintenance requirements that hardware manufacturers don’t face.

A firmware update that changes algorithm parameters potentially changes clinical performance. The clinical evaluation may need revision. But manufacturers often treat minor version updates as technical changes that don’t require clinical reevaluation.

The line between significant and non-significant changes isn’t always clear for AI software.

Consider a model that’s retrained on an expanded dataset. The architecture doesn’t change. The intended use doesn’t change. The manufacturer’s testing shows similar performance metrics. Does this require clinical evaluation update?

Under MDR, the answer depends on whether the change could affect clinical safety or performance. For AI models, retraining on different data distributions can alter behavior on edge cases even when aggregate metrics stay similar.

The conservative approach treats any retraining as potentially significant. But this creates practical problems. Some AI medical devices include adaptive learning components. Treating every adaptation as a significant change makes the device unworkable under MDR.

The resolution requires clear specification in the technical documentation of what changes trigger clinical evaluation review. This specification must be risk-based and justified. The PMCF plan must include monitoring for performance drift that might indicate the clinical evaluation assumptions no longer hold.

PMCF Requirements for AI Medical Devices

Post-market clinical follow-up becomes especially critical for AI SaMD.

The pre-market clinical evaluation validates performance under controlled conditions. PMCF monitors whether this performance translates to real-world use across the full intended use population.

For AI software, this monitoring must address several specific risks.

First, performance on populations different from training data. The validation dataset may not represent all patient subgroups in the intended use. PMCF should monitor performance across demographic and clinical variables to detect unexpected performance degradation.

Second, user interaction effects. Clinicians may use the software in ways not anticipated during development. They may override recommendations based on factors the algorithm doesn’t consider. PMCF should capture these interaction patterns and assess whether they indicate safety concerns.

Third, data drift. The characteristics of clinical data change over time. Imaging equipment improves. Clinical protocols evolve. Patient populations shift. The AI model trained on historical data may encounter input distributions that differ from training data.

Key Insight
The PMCF plan for AI SaMD must include specific metrics for detecting performance drift, subpopulation performance variation, and real-world usage patterns that differ from intended use. Generic PMCF plans that don’t address these software-specific risks consistently receive deficiency notices.

Many PMCF plans I review include standard templates. They commit to literature monitoring and complaint analysis. These elements are necessary but insufficient for AI medical devices.

The plan needs specific performance monitoring with defined thresholds that trigger investigation. If sensitivity drops below X percent in subpopulation Y, what action does the manufacturer take? If user override rates exceed Z percent, how does this feed back into risk management?

Without these specifics, the PMCF plan doesn’t fulfill its MDR purpose.

The Validation Dataset Problem

Clinical evaluations for AI SaMD typically rely heavily on validation dataset results. The manufacturer demonstrates that the algorithm achieves acceptable performance metrics on a test dataset not used during training.

Notified Body reviewers consistently question validation dataset representativeness.

Was the dataset collected from sites similar to the intended use environment? Does it include the full range of patient characteristics in the intended use population? Were exclusion criteria applied that might limit generalizability?

A validation dataset from academic medical centers may not represent community hospital populations. A dataset from one geographic region may not represent global use. These limitations must be acknowledged and addressed in the clinical evaluation.

The evaluation should explicitly state validation dataset characteristics. It should compare these characteristics to the intended use population. Where gaps exist, it should explain how PMCF will monitor performance in underrepresented groups.

Common Deficiency
Validation results presented without sufficient description of dataset characteristics, collection methods, or representativeness analysis. Reviewers cannot assess whether the validation data supports the intended use claims without this contextual information.

Black Box Algorithms and Clinical Evaluation

Deep learning models often function as black boxes. The algorithm makes predictions, but the decision pathway isn’t interpretable in clinical terms.

This creates tension with clinical evaluation requirements. The evaluation must explain the device’s mechanism of action. For traditional devices, this means material properties, mechanical function, or pharmacological effects.

For black box AI, what constitutes explanation of mechanism?

Some manufacturers describe the algorithm architecture. They explain the neural network structure, activation functions, and training process. This technical description doesn’t satisfy clinical evaluation requirements.

The clinical evaluation needs to explain what clinical information the algorithm uses and how this information relates to the clinical output. Even if the internal processing isn’t interpretable, the input-output relationship must make clinical sense.

A diagnostic algorithm that detects pneumonia from chest X-rays should identify which image features correlate with diagnostic decisions. Saliency maps or attention mechanisms can demonstrate that the algorithm focuses on clinically relevant image regions rather than artifacts or metadata.

Without this level of explanation, reviewers cannot assess whether the algorithm makes decisions for clinically sound reasons or whether it exploits spurious correlations in training data.

Risk Management Integration

ISO 14971 risk management for AI software must integrate tightly with clinical evaluation.

The risk analysis should identify failure modes specific to AI behavior. These include false negatives, false positives, performance degradation on atypical cases, and unexpected behavior on out-of-distribution data.

The clinical evaluation must provide evidence that these risks are acceptable given the clinical benefits. This requires quantifying the clinical consequences of different failure modes.

A false negative in a cancer screening algorithm has different clinical consequences than a false negative in a scheduling optimization algorithm. The clinical evaluation must address these consequences explicitly.

Risk control measures for AI software often involve user interface design, user training, and clinical workflow integration. The clinical evaluation should verify that these measures effectively reduce risk to acceptable levels.

If the risk control relies on clinician oversight of algorithm outputs, the evaluation must demonstrate that clinicians can effectively identify algorithm errors. If they can’t, the risk control doesn’t function as assumed.

What Actually Satisfies Notified Bodies

After reviewing dozens of AI SaMD submissions and subsequent deficiency notices, patterns emerge in what satisfies reviewers.

Comprehensive validation that addresses representativeness. Not just performance metrics, but analysis of performance across patient subgroups. Discussion of validation dataset limitations and how these affect generalizability claims.

Clear explanation of clinical pathways. How algorithm outputs lead to clinical decisions. What clinical benefit results from these decisions. Evidence linking algorithm performance to clinical outcomes.

Explicit SOTA comparison. Not just description of the device approach, but comparison to alternative approaches. Justification for design choices based on clinical requirements rather than just technical considerations.

Specific PMCF commitments. Defined metrics, monitoring methods, and action thresholds. Plans that address software-specific risks like performance drift and subpopulation variation.

Risk-benefit analysis that acknowledges AI-specific risks honestly and demonstrates proportionate benefits.

The clinical evaluations that proceed smoothly through review don’t hide behind technical complexity. They translate technical characteristics into clinical language. They acknowledge uncertainties and explain how post-market monitoring will address them.

Looking Forward

The regulatory landscape for AI medical devices continues evolving. MDCG guidance specific to AI is in development. Notified Body reviewers are becoming more sophisticated in their understanding of machine learning.

Manufacturers who approach clinical evaluation for AI SaMD as a checkbox exercise will continue receiving deficiency notices. Those who recognize that the clinical evaluation must address the unique characteristics of AI behavior will find the path clearer.

The fundamental principle remains unchanged. The clinical evaluation must demonstrate that the device is safe and performs as intended across the full scope of intended use. For AI medical software, this requires evidence types and analysis approaches that differ from traditional devices.

But the requirement itself isn’t different. It’s clinical evaluation under MDR, applied to a technology that challenges some of our traditional evidence frameworks.

The manufacturers succeeding in this space are those who acknowledge this challenge directly and build their clinical evidence strategy accordingly.

Peace,
Hatem
Clinical Evaluation Expert for Medical Devices
Follow me for more insights and practical advice.

Frequently Asked Questions

What is a Clinical Evaluation Report (CER)?

A CER is a mandatory document under MDR 2017/745 that demonstrates the safety and performance of a medical device through systematic analysis of clinical data. It must be updated throughout the device lifecycle based on PMCF findings.

How often should the CER be updated?

The CER should be updated whenever significant new clinical data becomes available, after PMCF activities, when there are changes to the device or intended purpose, and at minimum during annual reviews as part of post-market surveillance.

What causes CER rejection by Notified Bodies?

Common reasons include inadequate equivalence demonstration, insufficient clinical data for claims, poorly structured SOTA analysis, missing gap analysis, and lack of clear benefit-risk determination. Structure and logical flow are as important as the data itself.

Which MDCG guidance documents are most relevant for clinical evaluation?

Key documents include MDCG 2020-5 (Equivalence), MDCG 2020-6 (Sufficient Clinical Evidence), MDCG 2020-13 (CEAR Template), MDCG 2020-7 (PMCF Plan), and MDCG 2020-8 (PMCF Evaluation Report).

Need Expert Help with Your Clinical Evaluation?

Get personalized guidance on MDR compliance, CER writing, and Notified Body preparation.

Peace, Hatem

Your Clinical Evaluation Partner

Follow me for more insights and practical advice.

References:
– MDR 2017/745 Article 61 (Clinical Evaluation)
– MDR 2017/745 Annex I (General Safety and Performance Requirements)
– MDR 2017/745 Annex XIV (Clinical Evaluation Requirements)
– MDCG 2020-5 (Clinical Evaluation Equivalence)
– MDCG 2020-1 (Guidance on Clinical Evaluation)