Clinical evaluation for SaMD: cover the whole intended purpose

Clinical evaluation under the UK and EU MDR lives or dies on how well you evidence the intended purpose.

A neat way to plan a clinical evaluation strategy is to consider the key component of the intended purpose and then choose clinical evaluation activities that, together, cover them all.

Start with the intended purpose

Patient population: Who is the software for? Age ranges, comorbidities, etc
Clinical conditions: Which signs, symptoms, or diseases will the software address?
Use environment(s): Where will it be used — home, ambulance, primary care, radiology workstation, operating theatre?
User(s): Who will use it — lay people, nurses, radiographers, physicians, or mixed teams?

Map activities to components

Different activities naturally illuminate different parts of the intended purpose:

Usability and human-factors testing usually involve the intended users working in the intended-use environments with realistic tasks and use errors
Retrospective clinical-performance studies — for example, running the algorithm on curated records or images — primarily involve the patient population and the clinical conditions
Clinical studies generally put the device in the hands of the intended users, and will recruit the intended patient population for the purpose of identifying, managing, and predicting the relevant clinical conditions
The constrained nature of clinical investigations may mean the device is not used in a setting fully reflective of its intended-use environment
Real-world studies (for example, prospective observational deployments in routine care) can cover all four elements at once

An example portfolio (imaging triage software)

Imagine software that flags suspected pneumothorax on chest X-rays to speed up radiologist review, intended for adult emergency patients in hospital radiology and emergency departments.

Usability and human factors: Simulated reading sessions with radiologists and emergency physicians using the actual workflow in the picture archiving system and emergency-department viewing stations
- Goal: demonstrate safe, effective interaction, and risk control for use errors relevant to those environments
Retrospective clinical performance: Multi-site dataset with ground-truth pneumothorax diagnoses from blinded consensus covering male and female adults (across all adult age brackets) presenting to emergency departments.
- Endpoints: sensitivity, specificity, time-to-notification distribution; pre-specified subgroup analyses (portable vs fixed X-rays, trauma vs non-trauma). This directly addresses the target patients and condition
Prospective clinical investigation: Study where readers act on software outputs in real time, but within a controlled research protocol.
- Measures: impact on time to first read, change in reporting accuracy, and workflow effects. This adds evidence about real users acting in practice, though not in unconstrained environments
Real-world deployment study (where feasible pre market, or planned as post-market clinical follow-up): Limited-scope rollout across emergency and radiology departments with routine staffing, real interruptions, and site variability.
- Collect effectiveness, safety signals, and usability in context — covering all four components together. Note that while real-world evidence may sound like a panacea, it has its own specific limitations to consider, including data representativeness, bias, confounding and difficulty demonstrating causal effects

Right sizing for pre-market certification

Regulators expect a clinically meaningful body of evidence that reflects the way the software will actually be used. The most efficient portfolios combine activities so that all four components are covered before certification.

When an activity can credibly cover more than two components at once — such as a well-run real-world study — use it.

Where a prospective trial cannot represent the use environment, balance it with usability work in real settings or with a pragmatic deployment study.

Common pitfalls to avoid

Over-reliance on curated retrospective data without showing performance on messy, real workflows
Ignoring the use environment (for example, assuming radiology and emergency department contexts are interchangeable)
Choosing endpoints that do not match the clinical decision your intended users must make
Study population not representative of the target population (e.g. in terms of patient demographics, condition severity or presentation)

Key takeaways

It's important to plan evidence against patient population, clinical conditions, use environments, and users.

Match activities to components: usability (users & environments), retrospective studies (population & conditions), prospective investigations (population & conditions & users), real-world studies (all four).

For pre-market certification, assemble a portfolio that covers every component; the more components each study addresses, the stronger — and usually more efficient — your case.

Clinical evaluation for SaMD: cover the whole intended purpose

Clinical evaluation under the UK and EU MDR lives or dies on how well you evidence the intended purpose.

Start with the intended purpose

Map activities to components

An example portfolio (imaging triage software)

Right sizing for pre-market certification

Common pitfalls to avoid

Key takeaways

Related

LLM series — Can your medical device use ChatGPT and still get certified under EU MDR? The case for third-party LLMs in AIaMD

The pillars & pitfalls of the software technical file under EU MDR

Clinical data under EU MDR: why you have more options than you might think

Want Scarlet news in your inbox?