Setting Acceptance Criteria for AI Software Performance

Hatem Rabeh

Written by HATEM RABEH, MD, MSc Ing

Your Clinical Evaluation Expert And Partner

in
S

Your AI achieves 87% accuracy. Is that acceptable? Without predefined acceptance criteria derived from state of the art, you cannot answer. You might celebrate results that reviewers reject. Or abandon a product that actually exceeds benchmarks. Acceptance criteria are not arbitrary numbers. They are evidence-based thresholds.

Acceptance criteria define the measurable thresholds your AI software must achieve. They appear in your CEP and are validated in your CER. Without them, your clinical evaluation has no success criteria. With poorly chosen criteria, you set yourself up for failure or meaningless success.

What Regulatory Guidance Requires

MDCG 2020-1 requires AI software to demonstrate valid clinical association, analytical performance, and clinical performance. Each pillar needs specific acceptance criteria tailored to device risk and intended population.

MDR Annex I mandates that acceptance and performance criteria are defined up front, including reliability, accuracy, robustness, and safety under normal and foreseeable misuse. The key phrase is defined up front. You cannot evaluate evidence without knowing what success looks like.

Key Insight
Acceptance criteria defined without SOTA analysis are arbitrary. Criteria derived from SOTA benchmarks are defensible. Reviewers know the difference immediately.

The Four-Step Method

Step 1: Identify key performance metrics aligned with your intended use. What does success look like for this device? Sensitivity, specificity, positive predictive value, time to result, error rates. List every metric that matters for your clinical claims.

Step 2: Benchmark against state of the art. What do current solutions achieve? What does the literature establish as clinically meaningful? Build a table of current performance levels from published studies and competitor data.

Step 3: Set thresholds based on risk analysis and clinical requirements. Where must you match current standards? Where must you exceed them? Where is lower performance acceptable because you address a different need?

Step 4: Validate that criteria are measurable and achievable. Can you actually measure these metrics with available data? Is there a realistic path to achieving these thresholds?

Example Acceptance Criteria

Sensitivity Target
90%
Specificity Target
85%
AUROC Target
90%
Subgroup Floor
85%

Example Criteria for AI Diagnostic Software

Strong acceptance criteria look like this:

  • Sensitivity at least 90% with 95% CI lower bound at least 85% on external validation dataset
  • Specificity at least 85% with 95% CI lower bound at least 80% on external validation dataset
  • AUROC at least 0.90 on benchmark data, at least 0.85 on real-world data
  • Performance degradation no more than 5% from clean to degraded input conditions
  • Subgroup floors: sensitivity at least 85% for each predefined demographic group
  • Time to result under 3 seconds per case
  • Serious use error rate below 1% in summative usability testing

Notice the specificity. Not just high sensitivity but a number with a confidence interval requirement and a dataset specification.

Common Rejection
Reviewers flag devices with criteria that lack defined thresholds, feature only lab metrics without real-world validation targets, or miss subgroup requirements.

Risk-Based Scaling

Higher risk requires tighter thresholds and more pre-market evidence. A diagnostic device for life-threatening conditions needs tighter sensitivity thresholds than a wellness application. A device used in emergency settings needs faster response time criteria than one used in routine screening.

Document the risk reasoning behind each threshold. Why is 90% sensitivity the right target? Because lower sensitivity in this clinical context could result in missed diagnoses with specific harm potential. This reasoning demonstrates that criteria are derived, not arbitrary.

AI-Specific Criteria

AI software needs additional criteria that traditional devices do not:

  • External validation across multiple independent sites
  • Subgroup performance across demographic and clinical categories
  • Robustness under degraded input conditions
  • Calibration metrics where probability outputs are shown to users
  • Drift monitoring thresholds for post-market surveillance

In the next post, we cover how to appraise evidence specifically for AI software.

Peace,
Hatem
Your Clinical Evaluation Partner

Frequently Asked Questions

What if my device cannot meet benchmark thresholds?

If your device offers other advantages like speed, cost, or accessibility, criteria may be justifiably lower for some metrics. Document the clinical reasoning. A faster device with slightly lower sensitivity may be appropriate for screening settings.

Should criteria include confidence intervals?

Yes. A criterion of 90% sensitivity is ambiguous. A criterion of 90% sensitivity with 95% CI lower bound of 85% is precise. Include statistical requirements that reflect uncertainty.

Series: AI Medical Device Clinical Evaluation

Part 3 of 6

Coming Soon

Appraising Evidence for AI Medical Software

Need Expert Help with Your Clinical Evaluation?

Get personalized guidance on MDR compliance, CER writing, and Notified Body preparation.

Peace, Hatem

Your Clinical Evaluation Partner

Follow me for more insights and practical advice.

References:
– MDCG 2020-1: Guidance on Clinical Evaluation for Medical Device Software
– MDR 2017/745 Annex I GSPRs
– CORE-MD Framework