A Clinical Maturity Lens read of Babylon Health

In August 2019, Dr David Watkins, an NHS consultant oncologist, ran the same case through Babylon Health's symptom-checker twice. A 59-year-old smoker, sudden onset chest pain, nausea. First as a man. Then as a woman. The chatbot routed the man to A&E with information about a possible heart attack. It routed the woman to her GP within six hours with information about a panic attack (TechCrunch, February 2020).

The most common reading of this incident is gender bias in the training data. That reading is true but incomplete. The deeper failure was structural, and it lived several layers above the model. It happened in how Babylon defined what triage was supposed to do.

This is a read of Babylon through the Clinical Maturity Lens. Tessa's failure was about a product carrying on with the wrong patient. Babylon's failure was earlier than that: the product never had a sufficient model of the patient in the first place.

What Babylon was designed to do

Babylon Health was founded in 2013 by Ali Parsa as an AI-powered triage and consultation service. By 2019 the company had a $2 billion valuation, partnerships with the NHS through GP at Hand, a 10-year contract with the Royal Wolverhampton NHS Trust covering 300,000 people, and a 10-year agreement with the government of Rwanda intended to digitise primary triage for the country's population (TechCrunch, March 2021). The product was promoted as on-par with top-rated practicing clinicians at a London event in 2018 (TechCrunch, February 2020). Independent reporting at the time questioned whether the underlying evidence stood up the marketing (Undark, December 2019).

The product's documented intended use was narrower than the marketing. Babylon's terms of service stated the chatbot did not provide a medical diagnosis and was not a substitute for a doctor. Its in-app disclaimer told the patient that the service "does not diagnose your own health condition or make treatment recommendations for you" (TechCrunch, February 2020).

In practice, the patient on the other end of the chatbot was given suggested diagnoses, advised whether to seek emergency care, the GP, or self-care, and frequently treated this as triage advice. That is what triage chatbots are for. The gap between documented intent and in-product behaviour was the same gap Tessa had: clinical scope on paper, undefined in the product.

What happened in deployment

Watkins began raising safety concerns with UK regulators in February 2017. By May 2018, the MHRA had independently notified Babylon of two safety incidents involving the chatbot, one missed heart attack and one missed deep vein thrombosis (TechCrunch, March 2021). The Health Service Journal reported the safety investigation that summer (HSJ, June 2018). Watkins continued documenting failures publicly on Twitter under the handle @DrMurphy11, including the August 2019 gender-bifurcation case and a separate cardiac case the following month.

In February 2020, Babylon responded to Watkins with a press release that referred to him as a "troll" and used his personal app-usage data to dispute his findings. He had unmasked himself the day before at a Royal Society of Medicine event (TechCrunch, February 2020).

On 4 December 2020 the MHRA wrote to Watkins. The letter said: "Your concerns are all valid and ones that we share" (TechCrunch, March 2021).

Babylon's approach to the safety issues, by Watkins's account, was piecemeal:

"They never spent time addressing the broader fundamental issues within the system. Hence, safety issues would repeatedly crop up." (TechCrunch, March 2021)

The eventual remediation pattern was to route every chest-pain presentation to A&E by default. The triage step had been replaced by a kill switch.

In May 2023 Babylon announced it would go private through a deal with MindMaze. In August 2023 the deal collapsed, the company filed for Chapter 7 bankruptcy in the United States, and the UK arm was sold to eMed Healthcare for £500,000 (TechCrunch, August 2023). The Rwanda business, which had been intended to triage primary care for 2.8 million people, was wound down.

Reading Babylon through the Lens

Foundation: Assessment is the product

The Lens begins with three foundation questions. Can this person safely benefit, or should they be redirected before they start (Assessment)? Should this person continue, and is the product itself behaving safely (Screening)? Is risk being managed continuously, at user and product level?

For a triage chatbot, Assessment is not a gate. It is the whole product. When the entire premise is we will route you to the right level of care, there is no Assessment step before the product begins. The Assessment is the product.

This places the foundation question differently than it sat for Tessa. Tessa's Assessment failure was that the wrong population arrived at the door of a tool that wasn't designed for them. Babylon's Assessment failure was that the door itself routed half the population to the wrong place.

When Assessment is the product, the bar for it is whether the product can reliably perform the Assessment across the full range of patients who present, including the ones whose symptoms are atypical for the most common script. Babylon's reliability gap on that bar was documented publicly for three years and acknowledged privately by the regulator a year after that.

Dimension 1: Variability tolerance

How does your product respond to the user who pushes through pain, the one who stops at the first discomfort, and the one who tells you what they think you want to hear?

The 59-year-old smoker with chest pain and nausea is not an edge case. She is a textbook clinical presentation of myocardial infarction in a female patient. Women presenting with chest pain plus nausea are part of the foundational training of any clinician working in emergency or primary care. The "atypical" presentation is atypical only in the sense that it is not the most-common-television presentation. In the clinic, it is routine.

When Babylon's triage routed her to a panic attack diagnosis and routed her identical male twin to A&E, the system was doing what its design told it to do. The product had been built around a representative patient. Normal variability was being treated as anomaly. That is a Dimension 1 failure at the foundation of the product.

Variability tolerance is what distinguishes triage from pattern-matching. Pattern-matching asks does this look like the typical case. Triage asks what is the worst plausible thing this could be, and what is the safest next step given that. The two questions have different answers for the same input. Babylon was running the first question and selling the second.

Dimension 6: Edge case awareness

Who should NOT use this, and does the product know that and act on it?

Babylon's exclusion criteria existed on paper. Its support page titled "When not to use the Babylon chatbot" specified situations in which the product should not be relied on. The chatbot's own in-product disclaimer noted it was for information only and not a substitute for a doctor.

None of this was implemented as a screening behaviour inside the chatbot. The disclaimer and the exclusion criteria were documentation. The product had no mechanism to detect a patient outside its competence and route them out. Female cardiac presentation was inside the designed population. It was outside the product's ability to handle. The two were treated as the same, and the product behaved as though they were.

This is the same gap that ended Tessa, expressed differently. Edge case awareness is about the product acting on the cases the team already knows about. Babylon's team knew about the cardiac issues. By the time the MHRA had written to Watkins acknowledging his concerns, three and a half years had passed since the initial complaint. The piecemeal fix pattern was the visible signature of that gap.

Dimension 4: Failure containment

If your product has been quietly wrong for two weeks, what limits the damage?

Babylon's pattern for handling reported failures was to repair the surface symptom on the specific triage assessment that had been flagged. A particular age, a particular symptom combination, a particular outcome path. The fix would remove the visible failure. The structural cause would remain.

Failure containment is the dimension that asks whether a product can stop a class of error after it has been identified. Babylon's failure containment lived at the case level, not the class level. When Watkins flagged a chest-pain triage that misrouted, the team fixed that specific path. The next chest-pain triage with slightly different parameters would still misroute, and the cycle would repeat.

The eventual containment mechanism, when the pattern of repeated failures became too costly to manage, was to send every chest-pain triage straight to A&E by default. This is what happens when a product without failure containment is asked to behave as if it has it. The system gives up the triage function and ships every signal to the highest-acuity destination. The product's stated purpose has been abandoned in order to remove the failures.

Dimension 3: Clinician override

When a user's situation exceeds what the product can handle, who takes over, and how does that handoff actually happen?

Babylon's clinician override existed in theory. The product was the entry point to GP at Hand consultations and could route to a video GP. Inside that flow, the override was structured as a continuation of the funnel: triage → GP appointment → outcome.

Clinician override as a Lens dimension is not the same thing as "the patient can book a GP." It is the dimension that asks what happens when the product encounters a case it should not be making decisions about. Does it stop, hand off, and stay stopped? In Babylon's case, the chatbot continued making decisions and routed the patient onward. The clinician at the next step received a patient whose triage had already been performed, often with the wrong urgency. By the time the handoff happened, the decision had been made.

Signal Integrity: the continuous layer

The Continuous Layer of the Lens asks whether the system knows when its inputs have become unreliable. For Babylon, the integrity problem sat downstream of the inputs. The patient's reported symptoms were accurate. The system's mapping from symptoms to a likelihood-of-condition score was where the integrity broke, and specifically in how that mapping changed with sex.

A 59-year-old smoker with sudden onset chest pain and nausea has the same severity of plausible cardiac event regardless of sex. The probability distribution Babylon was running over possible diagnoses for that patient was different by sex, and that difference was not coming from the underlying clinical reality. It was coming from a model whose inputs had been weighted by population distribution in training data rather than by clinical risk.

Signal Integrity catches this when the system has a check against expected case mix. Babylon had none.

Where the product sat on the Maturity Spectrum

Running Babylon through the three levels of the Spectrum, the product sat below Exploratory. Clinical logic existed in name. It did not survive contact with the patients who actually arrived. The product had been deployed at the scale of a User-ready product, with a $2 billion valuation and contracts covering millions of patients, while its foundation pillar was unresolved.

An Exploratory version of Babylon would have been honest about its limits. A Developing version would have had a defined population, documented exclusion criteria implemented as in-product behaviour, and a feedback loop from misrouted cases back to the triage model within a defined timeframe. A User-ready version would have known the case-mix distribution of its deployed population, monitored variability tolerance continuously, and held the product off any new market until it had been pressure-tested against that market's population.

None of those gates were in place. The product was sold as if they were.

The Maturity Spectrum is a way of saying out loud where a product actually sits, against where it claims to sit. The distance between those two places is what determines how much harm reaches the patient.

What this case says about the work

The Babylon story is often read as a regulatory failure, or an AI hype failure, or both. Those readings are accurate at the surface. The deeper read is that the product was making a clinical claim it could not behave like. Triage is a clinical act. The product offering it had not been pressure-tested against the variability that any clinician sees in their first month of practice.

Variability is the work of triage. A product that treats the predictable distribution of human presentation as a series of one-off failures to be patched after the fact has built a pattern-matcher with a triage interface.

A product can be highly capitalised, regulator-engaged, and politically endorsed, and still sit below Exploratory if its foundation pillar is broken. Babylon's collapse was financial in form. The underlying loss had begun much earlier, in the gap between what the chatbot was sold to do and what it could actually behave like.

Severity is fixed. Likelihood is designed.