LLM series — Can your medical device use ChatGPT and still get certified under EU MDR? The case for third-party LLMs in AIaMD
The short answer: Yes, even though it's not easy, it is possible.
Large Language Models (LLMs) have taken the world by storm. Companies in every industry are investigating whether this powerful technology can enhance efficiency and bring new capabilities to bear in their respective fields.
The world of medical devices is no different. Many innovators are prototyping AI as a Medical Device (AIaMD) products on top of the readily available third-party, commercial LLMs by OpenAI, Anthropic, Google, and others.
Right now, these innovators are asking regulatory consultants and Notified Bodies the big question: Is this a viable strategy? Can I use a third-party, commercial LLM in my AIaMD and get it certified?
There is considerable discord among medical-device-regulation experts on this topic. Many experts are rightfully cautious. They highlight the risks associated with the use of LLMs and the risks associated with the control of this technology being held by a third party. For many experts, the answer is: “No. Right now, it is not safe.”
A workable approach
As a Notified Body, Scarlet aims to facilitate and empower innovators, while also holding them to account for evidencing the safety and performance of the devices they produce. As such, Scarlet’s position is that we will consider submissions that utilise third-party LLMs.
However, the burden lies with the manufacturer to ensure that they have comprehensively identified the risks associated with this approach, implemented effective risk-control measures, and developed robust post-market monitoring, change-management, and supplier-management processes to ensure that the device remains safe and effective throughout its lifetime.
In practice, a viable strategy typically requires three things:

So what are the risks? How can they be mitigated?
Below are the commonly cited risks of using third-party, commercial LLMs in AIaMD, along with practical approaches to mitigation, monitoring, and ongoing management.
Model version control
What is the risk?
The LLM landscape is highly competitive and rapidly evolving. Major vendors frequently revise their models, and older models can be deprecated at short notice.
Because LLMs are non-deterministic and not fully transparent, it is difficult to credibly argue that iterative revisions will not impact the safety and performance of an AIaMD that depends on them.
If a vendor obsoletes the model your device relies on, your device may face availability risk, forced migration, or loss of validated performance.
How can you mitigate it?
- Pin to a specific model snapshot, not a floating alias. For example, pin to a dated snapshot (e.g., gpt-5.1-2025-11-13) rather than “latest” (e.g. gpt-5.1-latest). This reduces the risk of uncontrolled performance change
- Develop an efficient change control process for retraining and revalidating on newer model versions as they are released. Proactive migration reduces the risk of being forced into urgent change when obsolescence is unexpectedly announced
- Consider open-source models, where you have better control over versioning and more ability to manage obsolescence
Real-world behaviour changes
What is the risk?
When you move between versions of the commercial LLM, the performance characteristics are highly likely to change due to differences in training data and model design.
This means that even if you demonstrate safety and performance during design and development, there is a legitimate risk that your device’s real-world behaviour changes after a supplier update or migration event.
How can you mitigate it?
- Treat model upgrades like safety-relevant changes. Use formal change management and clearly define what constitutes a “major” versus “minor” model change for your device
- Use a clinically grounded regression harness to compare behaviour across versions
- Define strict acceptance criteria and escalation triggers (including when to block deployment)
- Monitor in production for leading indicators (e.g. increased uncertainty flags, increased clinician-override rates, higher downstream correction rates), and feed these into CAPA plans where appropriate
Availability
What is the risk?
All software systems are subject to downtime and runtime performance issues. LLMs are no exception.
When relying on a third-party LLM, it is vital to consider the clinical risk that can arise from outage periods, degraded latency, rate limiting, or partial failures.
How can you mitigate it?
- Select an enterprise deployment that provides availability and performance guarantees aligned with your risk profile. Examples of managed enterprise cloud deployments include AWS Bedrock, Azure OpenAI Service, and Google Cloud Vertex AI
- Monitor availability and performance metrics and compare them against the assumptions you made during verification and validation. Generate and review incident reports if levels fall below expected standards
- Design for safe degradation. Provide clear status indicators and warnings to users, and consider user pathways to alternative care processes when the system is unavailable or delayed
- Evaluate the clinical impact of downtime/latency as part of risk management. If the residual risk is unacceptable, implement additional controls, or limit intended use accordingly
Data protection & confidentiality
What is the risk?
If prompts, context windows, or logs include patient or user data, sending them to a third-party service can create confidentiality, retention, residency, and access-control risks. These risks can be clinical (loss of trust, workflow disruption) and regulatory (privacy and security obligations).
How can you mitigate it?
- Minimise and de-identify data shared with the model wherever feasible
- Use contractual and technical controls around:
- data retention and deletion
- whether data is used for training
- residency/location requirements
- access controls and auditability
- Design logging carefully. Ensure audit logs support safety and traceability while still respecting data minimisation principles
Hallucincations & randomness
What is the risk?
Hallucincations and randomness are challenges for many machine-learning systems, including LLMs. These systems are inherently complex and can behave like “black boxes” from the developer’s perspective. This can make erroneous outputs difficult to triage, root cause, and resolve.
There are specific challenges when working with third-party LLMs. For instance:
- You have no access to training data or architecture details, and may not be able to tune model weights
- You cannot audit internal decision-making
- You're limited to what the provider chooses to disclose in the model output
How can you mitigate it?
There is no silver-bullet solution, but there are steps manufacturers can take to improve repeatability and stability of the output:
- Use structured prompts that produce checkable outputs, such as:
- short rationales for the output (surfacing hidden reasoning)
- explicit assumptions and uncertainties
- structured fields (e.g. JSON output constraints)
- source attribution when using provided reference material
- Implement audit logs capturing the input prompt, supplied context, model version, output, and any discrete determinations made (including uncertainty flags)
- Enforce input validation, restricting inputs to verified formats and ranges and rejecting all other invalid inputs
- Add output validation and constraint checks, especially around safety-critical statements, uncertainty thresholds, and prohibited content
- Utilise the API’s temperature parameters. Most commercial LLMs provide this parameter to allow you to control the “randomness” in text generation. Colder temperature selection provides more focused outputs, but does not eradicate hallucinations
- Leverage published documentation and external evidence about the model’s known limitations and behaviours, and incorporate these into risk analysis, testing, and user-facing labelling/instructions
Training data
What is the risk?
When choosing third-party, commercial LLMs, you relinquish control over training data and training processes. Many commercial LLMs are trained on massive, diverse datasets without a focus on your specific medical domain, intended use, or target population. This can elevate risks of harmful bias, irrelevant outputs, or clinically inappropriate behaviours.
How can you mitigate it?
- Perform robust technical and clinical evaluation of performance during design and development, focused on intended use and target population(s). This demonstrates that the LLM produces accurate and stable outputs, resulting in tangible clinical benefits. Further information on this topic is covered in our blog post about the key guiding principles for technical and clinical evaluation of LLMs as a Medical Device.
- Use output validation and constraint checks to flag uncertain responses and enforce safe output formats
- Continue evaluation through post-market activities (including surveillance for newly discovered failure modes) and maintain a pathway for corrective action
Plan for the worst
Above we have identified several methods to handle version changes proactively, continually monitor the performance of your device, and predict or report incidents of degradation in production.
However, this alone is not enough.
Manufacturers must also plan for worst-case scenarios where the device experiences serious production incidents and requires intervention. Manufacturers should define and maintain processes for rollback and recall to protect users from harm. Where regression failures and/or incident reports indicate severe issues, manufacturers must be able to roll back to a known, safe, and performant configuration swiftly or recall the device from use until safety and performance can be restored.
Conclusions
LLMs are a new and exciting technology with the potential to have a significant impact globally, including on the medical-device industry.
However, benefits must be weighed against risks. Using a third-party, commercial LLM can be a viable strategy. Still, the decision should not be taken lightly: the manufacturer assumes responsibility for risks introduced by dependency on a rapidly evolving external model.
Scarlet does not restrict manufacturers from attempting to leverage these LLMs within their AIaMD. However, when using third-party LLMs as SOUP, Scarlet expects a comprehensive identification, evaluation, and control of risks stemming from this design choice.
In particular, strong post-market monitoring and change management are crucial to demonstrate that the device can remain safe and effective throughout its lifespan, even as the underlying LLM landscape evolves.
Manufacturers who can manage the above should find that certifying a medical device harnessing the power of those third-party LLMs, like ChatGPT, is possible.

Want Scarlet news in your inbox?
Sign up to receive updates from Scarlet, including our newsletter containing blog posts, sent straight to you by email.


