A recipe for good SOUP: Part 3 – Large Language Models as SOUP - Scarlet - Certify your AI medical device without unnecessary delays

29 May, 2025

A recipe for good SOUP: Part 3 – Large Language Models as SOUP

Hann Yee Son

Back

2 Jun, 2026

Building SaMD under the AI Act: a practical guide

22 May, 2026

Not all AI systems are created equal: A guide to AI risk in healthtech regulation

In modern software, leveraging software of unknown provenance (SOUP) presents both benefits and risks. Nowhere is this more pronounced than with large language models (LLMs), which offer huge potential for Software as a Medical Device (SaMD) development.

Capping off our blog series on SOUP, we’ll explore one of the most recent developments that has the potential to transform modern medical devices. Integrating Large Language Models (LLMs) such as ChatGPT, Claude, or Med–PaLM into SaMD offers incredible potential, particularly in clinical decision-making support. However, incorporating LLMs into medical software introduces regulatory challenges, especially in software risk management, verification, and change management.

In this blog post, we’ll explain why LLMs are usually treated as SOUP and discuss their unique challenges compared to more traditional SOUP components.

Are LLMs always SOUP?

In medical device software, LLMs typically share these characteristics:

Developed by third parties
Not developed exclusively for medical device use
Trained on vast, generally undocumented corpora
Opaque APIs, with limited access to internal logic

The first two characteristics alone would qualify a typical LLM as SOUP. However, even an LLM trained in-house by a SaMD manufacturer for the explicit purpose of being integrated into their own SaMD product, would have the following characteristics:

The API is still opaque. Manufacturers generally cannot explain internal logic or retroactively determine how an LLM achieves specific outputs.
The in-house LLM may still leverage third-party components and training data.
The data the model is trained on may not have clear provenance or licensing.

Based on the above, and referencing the requirements of IEC 62304, most instances of LLMs in SaMD would qualify as SOUP.

The risks of incorporating LLMs into SaMD

State-of-the-art LLMs pose a number of regulatory challenges.

No control or transparency on the training data
Models are subject to biases inherent within the training data
Model outputs are non-deterministic
No confidence score is associated with the model outputs
Unpredictable failure modes: e.g. hallucinations
Lack of transparency on the internal logic of the model
Fast-moving technology with regular undisclosed updates

It is essential that the manufacturer comprehensively documents and mitigates risk related to LLMs as part of their risk management activities. A failure to do so may have the following consequences when LLMs are incorporated into a SaMD:

Unsafe recommendations: the LLM may generate inaccurate or unsafe suggestions in a clinical support context
False confidence in output: clinicians may over-trust well-formatted but incorrect output
Inconsistent behaviour: variability across sessions can confuse users or lead to misinterpretation of outputs
Lack of explainability: difficulty justifying recommendations or outputs for clinical audit in retrospect

How to control the risks

To mitigate the LLM-related risks, manufacturers should introduce adequate risk controls, prioritising those that ensure inherently safe software design. These may include:

Summary

LLMs are a new, exciting technology that offers extraordinary potential in the field of SaMD. However, given the opacity of these models, and their generally non-deterministic behaviour, they carry unique challenges that must be accurately defined and mitigated through risk management, inherently safe software design, and rigorous change management.

Need to certify an AI medical device or SaMD?

Get In touch