LLM series — Guiding principles for technical and clinical evaluation of LLMs as a Medical Device
Large language models (LLMs) and other forms of generative AI (GenAI) are rapidly moving into healthcare. They have enormous potential to transform its future for the better. This article outlines key approaches to assessing GenAI-based devices grounded in the latest best practices in AI evaluation research and regulatory practice under EU MDR.
When an LLM is central to delivering the claimed clinical benefit — whether by supporting diagnosis, guiding treatment, or influencing patient management — it must be evaluated with the same regulatory rigour as any other medical technology falling under the EU Medical Device Regulation (EU MDR). However, its specific characteristics introduce new challenges that must be addressed during technical and clinical evaluation.
First, let's start at the beginning.
What exactly is an LLM?
An LLM is a type of generative AI, trained on vast amounts of text to predict and generate language.
Unlike traditional models that output a classification (like a risk score, or a numeric value), LLMs produce open-ended text (such as sentences, summaries, reports, or recommendations) based on probabilistic reasoning. They are typically accessed through textual inputs, called prompts, which allow users or other systems to interact with them.
Their flexibility makes them powerful tools for natural-language tasks but also introduces risks that are uncommon in conventional AI medical software. When an LLM directly contributes to a clinical function, for example, by generating or influencing content that informs diagnosis or treatment, it falls within the regulatory scope of medical devices and must comply with all applicable EU MDR requirements.
How LLMs differ from “classical” AI models
Traditional machine-learning models in healthcare typically operate with structured inputs and bounded outputs. For example, an image classifier that takes as input X-ray images of lung lesions and outputs “benign” or “malignant.”
These traditional models' performance can be measured with familiar metrics such as sensitivity, specificity, and AUC (Area Under the ROC Curve) for classification — or root mean square error and 95% limits of agreement for continuous-values prediction.
Generative AI differs in several key ways that affect both risk and evaluation strategy:

These characteristics make exhaustive validation impossible and challenge the classical notions of reproducibility, traceability, and interpretability that underpin medical device safety.
Defining the evaluation scope
Evaluation should always be tied to the intended purpose of the device. The same LLM may require very different validation approaches depending on its use case:
- AI scribe: factual accuracy and protection against fabricated details
- Decision support tool: clinical appropriateness and avoidance of misleading recommendations
- Diagnostic assistant: agreement of generated outputs with validated clinical reference standards or ground-truth diagnoses
Manufacturers should define and justify:
- Evaluation objectives (e.g. factuality, safety, reproducibility)
- Expected behaviours and error types (e.g. omission vs. fabrication)
- Acceptability thresholds linked to identified risks and desirable technical and clinical performance aspects
And do so using scientifically documented and valid frameworks, notably identified through a scientific literature search. This clarity ensures that both pre-market and post-market evaluation activities are aligned with the device’s purpose and risk profile.
Technical evaluation of GenAI: frameworks and metrics
LLM evaluation cannot rely solely on classical accuracy measures.
Considering the broad range of technical aspects to evaluate and associated challenges, multiple evaluation frameworks will likely need to be used to assess the technical performance of GenAI devices. These frameworks can be automated, manual, or a hybrid of the two.
Automatic evaluation
Automated frameworks aim to provide scalable, reproducible assessments. Examples include:
- Factuality and entailment scoring: frameworks such as QuestEval (see reference 1 below) or FActScore (see reference 2 below) decompose outputs into atomic facts and test each against trusted references
- Uncertainty estimation: analysing semantic entropy (see reference 3 below) can flag high-uncertainty responses that are statistically more prone to hallucination
- Smoothness testing: assessing how small input perturbations (e.g. paraphrasing a prompt) affect output consistency helps reveal instability
Manual evaluation
Human expert judgement often remains indispensable, especially for assessing clinical appropriateness:
- Clinical-appropriateness review: subject-matter experts assess factual accuracy and clinical safety in context
- Error categorisation: classifying hallucinations as major (potentially altering care) or minor (stylistic or low-impact) enables a risk-based perspective (see reference 4 below)
- Comparison to human baselines: evaluating how model outputs compare to clinician-generated equivalents
Hybrid approaches
- Automated triage with human adjudication: scalable automation with expert review for flagged cases
- Continuous post-market monitoring: feedback from clinicians and real-world use helps identify new failure modes over time
The overarching principle: evaluation must be risk driven, context specific, and justified by the manufacturer.
Linking technical evaluation to clinical evidence
Strong technical metrics do not automatically translate into clinical benefit. Manufacturers must demonstrate how improved model performance leads to measurable outcomes for patients or clinicians.
- Technical performance assesses plausibility, fidelity, and stability of generated outputs
- Clinical performance demonstrates tangible benefits such as improved diagnostic accuracy, reduced errors, or workflow efficiency beneficial to patients
Suitable study designs may include comparative studies, reader studies with clinician panels, or real-world evaluations. Technical metrics and clinical endpoints should be treated as complementary, not interchangeable.
Tackling risk management
Risk management for GenAI should explicitly address the unique hazards and failure modes of generative systems, including for example:
- Hallucinations or omissions
- Over-reliance or automation bias
- Cognitive overload from verbose or inconsistent outputs
- Contextual mix-ups (e.g. referencing the wrong patient)
- Specific data-privacy risks
- Prompt injection or manipulation
Evidence from technical and clinical evaluation should feed directly into ISO 14971-aligned processes:
- Identify hazards
- Characterise hazardous situations
- Estimate and evaluate the risk
- Provide evidence that the technical and procedural controls reduce risk to acceptable levels
This ensures that evaluation activities serve their ultimate purpose: safeguarding patients and supporting responsible innovation.
Conclusion
LLMs promise to transform healthcare by enabling more natural human–machine interaction and supporting clinical decision-making in unprecedented ways. Yet their generative, stochastic, and open-ended nature demands evaluation frameworks that go beyond traditional metrics.
For RAQA professionals, the key is to ensure that these technologies are integrated into clinical practice with transparency, traceability, and safety — anchoring innovation within a robust regulatory framework. Then the power of these generative-AI models can be properly realised.
References:
- Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. QuestEval: Summarization Asks for Fact-based Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0
- Asgari, E., Montaña-Brown, N., Dubois, M. et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 8, 274 (2025). https://doi.org/10.1038/s41746-025-01670-7

Want Scarlet news in your inbox?
Sign up to receive updates from Scarlet, including our newsletter containing blog posts sent straight to you by email.


