Setting Acceptance Criteria for AI Software Performance

Written by HATEM RABEH, MD, MSc Ing

Your Clinical Evaluation Expert And Partner

Your AI achieves 87% accuracy. Is that acceptable? Without predefined acceptance criteria derived from state of the art, you cannot answer. You might celebrate results that reviewers reject. Or abandon a product that actually exceeds benchmarks. Acceptance criteria are not arbitrary numbers. They are evidence-based thresholds.

In This Article

What Regulatory Guidance Requires
The Four-Step Method
Example Criteria for AI Diagnostic Software
Risk-Based Scaling
AI-Specific Criteria

Need Expert Help?

Book a Call

Series Guide

AI Medical Device Clinical Evaluation

1
Why AI Software Needs Different Clinical…

2
Building Your Clinical Evaluation Plan f…

3
Setting Acceptance Criteria for AI Softw…

4
Appraising Evidence for AI Medical Softw…

5
Structuring the CER for AI Medical Softw…

6
Performance Claims and Ongoing Monitorin…

Progress
0/6

Acceptance criteria define the measurable thresholds your AI software must achieve. They appear in your CEP and are validated in your CER. Without them, your clinical evaluation has no success criteria. With poorly chosen criteria, you set yourself up for failure or meaningless success.

What Regulatory Guidance Requires

MDCG 2020-1 requires AI software to demonstrate valid clinical association, analytical performance, and clinical performance. Each pillar needs specific acceptance criteria tailored to device risk and intended population.

MDR Annex I mandates that acceptance and performance criteria are defined up front, including reliability, accuracy, robustness, and safety under normal and foreseeable misuse. The key phrase is defined up front. You cannot evaluate evidence without knowing what success looks like.

Key Insight
Acceptance criteria defined without SOTA analysis are arbitrary. Criteria derived from SOTA benchmarks are defensible. Reviewers know the difference immediately.

The Four-Step Method

Step 1: Identify key performance metrics aligned with your intended use. What does success look like for this device? Sensitivity, specificity, positive predictive value, time to result, error rates. List every metric that matters for your clinical claims.

Step 2: Benchmark against state of the art. What do current solutions achieve? What does the literature establish as clinically meaningful? Build a table of current performance levels from published studies and competitor data.

Step 3: Set thresholds based on risk analysis and clinical requirements. Where must you match current standards? Where must you exceed them? Where is lower performance acceptable because you address a different need?

Step 4: Validate that criteria are measurable and achievable. Can you actually measure these metrics with available data? Is there a realistic path to achieving these thresholds?

Example Acceptance Criteria

Sensitivity Target
90%

Specificity Target
85%

AUROC Target
90%

Subgroup Floor
85%

Example Criteria for AI Diagnostic Software

Strong acceptance criteria look like this:

Sensitivity at least 90% with 95% CI lower bound at least 85% on external validation dataset
Specificity at least 85% with 95% CI lower bound at least 80% on external validation dataset
AUROC at least 0.90 on benchmark data, at least 0.85 on real-world data
Performance degradation no more than 5% from clean to degraded input conditions
Subgroup floors: sensitivity at least 85% for each predefined demographic group
Time to result under 3 seconds per case
Serious use error rate below 1% in summative usability testing

Notice the specificity. Not just high sensitivity but a number with a confidence interval requirement and a dataset specification.

Common Rejection
Reviewers flag devices with criteria that lack defined thresholds, feature only lab metrics without real-world validation targets, or miss subgroup requirements.

Risk-Based Scaling

Higher risk requires tighter thresholds and more pre-market evidence. A diagnostic device for life-threatening conditions needs tighter sensitivity thresholds than a wellness application. A device used in emergency settings needs faster response time criteria than one used in routine screening.

Document the risk reasoning behind each threshold. Why is 90% sensitivity the right target? Because lower sensitivity in this clinical context could result in missed diagnoses with specific harm potential. This reasoning demonstrates that criteria are derived, not arbitrary.

AI-Specific Criteria

AI software needs additional criteria that traditional devices do not:

External validation across multiple independent sites
Subgroup performance across demographic and clinical categories
Robustness under degraded input conditions
Calibration metrics where probability outputs are shown to users
Drift monitoring thresholds for post-market surveillance

In the next post, we cover how to appraise evidence specifically for AI software.

Peace,
Hatem
Your Clinical Evaluation Partner

Frequently Asked Questions

What if my device cannot meet benchmark thresholds?

If your device offers other advantages like speed, cost, or accessibility, criteria may be justifiably lower for some metrics. Document the clinical reasoning. A faster device with slightly lower sensitivity may be appropriate for screening settings.

Should criteria include confidence intervals?

Yes. A criterion of 90% sensitivity is ambiguous. A criterion of 90% sensitivity with 95% CI lower bound of 85% is precise. Include statistical requirements that reflect uncertainty.

Series: AI Medical Device Clinical Evaluation

Part 3 of 6

Coming Soon

Appraising Evidence for AI Medical Software

Need Expert Help with Your Clinical Evaluation?

Get personalized guidance on MDR compliance, CER writing, and Notified Body preparation.

Book a Call
Subscribe to Newsletter

✌

Peace, Hatem

Your Clinical Evaluation Partner

Follow me for more insights and practical advice.

References:
– MDCG 2020-1: Guidance on Clinical Evaluation for Medical Device Software
– MDR 2017/745 Annex I GSPRs
– CORE-MD Framework

Setting Acceptance Criteria for AI Software Performance

What Regulatory Guidance Requires

The Four-Step Method

Example Acceptance Criteria

Example Criteria for AI Diagnostic Software

Risk-Based Scaling

AI-Specific Criteria

Frequently Asked Questions

Need Expert Help with Your Clinical Evaluation?

Contact Info

Get In Touch

Setting Acceptance Criteria for AI Software Performance

What Regulatory Guidance Requires

The Four-Step Method

Example Acceptance Criteria

Example Criteria for AI Diagnostic Software

Risk-Based Scaling

AI-Specific Criteria

Frequently Asked Questions

Need Expert Help with Your Clinical Evaluation?

Related Posts

Contact Info

Get In Touch