When a chatbot forgets who it's talking to
A Clinical Maturity Lens read of the Tessa / NEDA incident.
In May 2023, the National Eating Disorders Association took its chatbot, Tessa, offline less than a week after promoting it as a replacement for its human helpline. Users had reported that Tessa was giving weight-loss advice. Calorie deficits of 500 to 1,000 per day. Weekly weigh-ins. Skin-fold calipers to measure body fat. To most people, those are standard wellness prompts. To the population Tessa was deployed to serve, they are triggers.
The incident is usually told as a story about generative AI going wrong. Cass, the vendor operating Tessa, had upgraded the chatbot from a rule-based system to a generative one without NEDA's approval (NPR, June 2023). The hallucinations followed. That reading is true but incomplete. The deeper failure was clinical, not technical. It happened months before the generative upgrade, and a rule-based system would not have prevented it.
This is a short read of the Tessa incident through the Clinical Maturity Lens — foundation first, then the specific evaluation dimensions where the product failed.
What Tessa was designed to do
Tessa was built by Dr Ellen Fitzsimmons-Craft and Dr C Barr Taylor at Washington University in St. Louis, as a rule-based chatbot delivering the Body Positive programme — a CBT-based intervention for preventing eating disorders in people at elevated risk (WIRED, June 2023). The clinical evidence for the underlying programme was real. Tessa was carefully scoped. Its creators were explicit: it was not a treatment tool, not a replacement for clinical care, and not intended for people with active eating disorders.
What happened in deployment
In March 2023, NEDA announced it was winding down its 20-year-old helpline and directing users to Tessa (BBC News, June 2023). In late May, activist Sharon Maxwell — herself a person with a history of anorexia — tested the chatbot, disclosed her eating disorder, and was given dieting advice (NPR). Psychologist Alexis Conason ran the same test and got the same result (New York Times, June 2023). The chatbot was taken offline within 24 hours of Maxwell's screenshots going public.
The warning signs had existed for at least seven months. In October 2022, the executive director of the Multi-Service Eating Disorders Association, Monika Ostroff, had sent NEDA screenshots of Tessa telling her to avoid unhealthy foods and eat fruit snacks (NPR). That specific phrasing was removed. The underlying problem was not.
Reading Tessa through the Lens
Foundation: the Screening question was never asked
The Lens begins before any evaluation dimension, with three foundation questions: can this user safely benefit — or should they be redirected before they start (Assessment)? Should this user continue, and is the product itself behaving safely (Screening)? Is risk being managed continuously, at user and product level?
Tessa's Assessment and Screening gates were structurally absent. There was no redirection at the door. There was no continuation check. When a user disclosed an active eating disorder — the exact condition Tessa's creators had explicitly excluded from its designed population — the product had no mechanism to pause, redirect, or change its behaviour. It kept delivering the Body Positive script because that was the only script it had.
Everything else in the Tessa story flows from this foundation-level gap.
Dimension 0 — Problem definition
Who benefits from this product, and who could be harmed by it if the boundaries aren't clear?
Tessa's designed population — people at elevated risk of developing an eating disorder — is clinically distinct from the population NEDA's helpline served, which included people with active anorexia, bulimia, and binge eating disorder. These two populations receive opposite advice. A measured calorie deficit is reasonable guidance for someone at risk who has not yet developed disordered behaviour. The same guidance, delivered to someone in active illness, reinforces the illness.
The problem Tessa was built to solve was well-defined in the research paper. It was undefined in the product. When Tessa was repositioned as a helpline replacement, the problem definition on paper stopped matching the problem definition in deployment — and nothing in the product architecture enforced the original scope.
Dimension 6 — Edge case awareness
Who should NOT use this, and does the product know that and act on it?
The exclusion criterion for Tessa was not hidden. Fitzsimmons-Craft and her team stated repeatedly that the chatbot was not for people with active eating disorders. That exclusion was explicit in the clinical literature, in the Washington University team's public statements, and in NEDA's original framing.
It was not implemented in the product. The chatbot had no way of detecting an edge case user, no screening question that would have flagged active disorder, and no defined behaviour change when a user self-disclosed one. Maxwell said "I have an eating disorder" and Tessa continued inside the same script.
Edge case awareness is not about the edge cases you know about. It's about the product acting on the ones you know about. Tessa failed the simpler half.
Dimension 2b — Escalation, product side
What signals tell the product that a user is in trouble, and what does it do without waiting to be told?
The relevant signal in Tessa was not ambiguous. A user disclosed an eating disorder, in plain language, in the chat. That is a signal any clinically competent system would treat as a state change in the conversation.
Tessa treated it as content. The system had no escalation triggered by user state, no defined response path when a user disclosed active illness, no warm handoff, and no closing of the conversation. Because there was no escalation layer, the only thing that happened when the signal arrived was that the script continued.
This is the quiet version of escalation failure. The loud version is a symptom checker missing a myocardial infarction. The quiet version is a wellness chatbot carrying on with the wellness script while the person on the other end is telling it they are unwell. Both are the same dimension failing.
Dimension 3 — Clinician override
When a user's situation exceeds what the product can handle, who takes over, and how does that handoff actually happen?
Tessa had no handoff. The helpline had been closed. NEDA's written statement was that Tessa would be the route for people seeking support, with a later walk-back that "a chatbot cannot replace human interaction" (NPR). Operationally, though, the handoff was absent. A user whose situation exceeded what Tessa could handle had nowhere to go inside the product.
Clinician override is the dimension that defines where the product ends. Without it, every user who arrives outside the system's competence remains inside the system anyway. The system cannot exceed its limits, because it does not know where they are.
Signal Integrity: the continuous layer
The Continuous Layer of the Lens asks whether the system knows when its inputs have become unreliable. Tessa's Signal Integrity collapse was not at the level of noisy sensor data. It was at the level of user identity. The system had no perceptual layer that distinguished one population of users from another. The inputs themselves were fine — Maxwell's messages were coherent, direct, and accurate. What the system could not do was register who was sending them.
When the population of users arriving at Tessa changed — from NEDA members browsing a wellness tool to people who had called the helpline looking for crisis support — the system carried on as if nothing had changed. There was no monitoring of who was now on the other end of the conversation, no threshold for suspending deployment when population characteristics shifted, no feedback loop from content to population.
Signal Integrity is what catches this class of failure. Without it, a product can be running cleanly on every technical metric while being used by the wrong people.
Where the product sat on the Maturity Spectrum
Running Tessa through the three levels, the product sat firmly at Exploratory. Clinical logic existed. It was even well-researched. It had never been pressure-tested against the population it actually ended up serving, and the foundation-level screening and edge-case detection needed for Developing, let alone User-ready, were absent.
A Developing version of Tessa would have implemented at least one Assessment gate at the start of the conversation, one Screening check during the conversation, and one escalation trigger for active disclosure. It would still have had known gaps, but the obvious ones would have been closed.
A User-ready version would have looked different. Risk would have been evaluated at user level and product level continuously. Boundaries would have been defined and enforced, not documented. Accountability for misuse would have been clear before deployment. The product would have known its designed population and had a mechanism to detect when that population had shifted materially.
None of that requires generative AI. Most of it is product design downstream of clinical reasoning. The clinical reasoning simply had to reach the product.
What this case says about the work
The Tessa story gets read as an AI safety failure. That framing lets everyone off the hook. The harder reading is that Tessa was a product whose clinical scope was well-defined in a paper and undefined in the product. That gap — between clinical scope on record and clinical scope in behaviour — is exactly where digital health products break.
It breaks quietly for a long time. Ostroff flagged problems in October 2022. NEDA's own internal records show it knew (KFF Health News). Then it breaks loudly, and the product comes down in a week.
The Clinical Maturity Lens is built to close this gap before deployment, by treating clinical scope not as a claim in the documentation but as a behaviour the system is required to enforce. The foundation questions — Assessment, Screening, Risk — are specifically designed to catch the class of failure Tessa demonstrated: a system that works cleanly inside its designed population and has no way of recognising when a different population has arrived at the door.
Every product that supports users in a clinical context, whether it makes decisions or not, has a designed population and a deployed population. When those drift, the product has to know. That is not a regulatory problem. That is a clinical maturity problem.
Severity is fixed. Likelihood is designed.
This is the first in a series of Clinical Maturity Lens case reads. Next: the Babylon symptom checker and the question of what "triage" actually means.
Clinical Maturity Lens is a framework for evaluating digital health products against real-world clinical behaviour. For a read of your own product's clinical maturity, write to office@clinicalmaturitylens.com.