healthcare
LLMs failed 80% of clinical diagnoses. We called it 'reasoning' anyway.
AI diagnostic tools fail 80% of the time in early clinical reasoning, while 'enhanced' surgical systems lead to strokes. Is the FDA authorizing faster than it can monitor?
The regulatory mathematics of the FDA are currently operating in a state of cognitive dissonance. As of early 2026, the agency has authorized over 1,357 medical devices that utilize artificial intelligence, a figure that has doubled in just four years according to Reuters and FDA records. On paper, this is a triumph of modernization; in the operating room, it looks like Erin Ralph. Ralph, a patient who underwent what was marketed as "AI-enhanced" sinus surgery, ended up suffering a stroke when the navigation software misidentified her anatomy. Her story is not a statistical outlier, but the predictable output of a system where marketing terminology has outpaced algorithmic reliability.
The rapid integration of AI into clinical environments—from diagnostic LLMs to surgical navigation software—has created a measurable safety deficit where "AI-enhanced" systems demonstrate an 80% failure rate in differential diagnosis and higher rates of immediate post-market recalls compared to traditional medical hardware. We are witnessing a "safety lag" where the velocity of authorization has decoupled from the reality of patient outcomes, effectively turning the American hospital system into a high-stakes beta test for software that confidently hallucinates the location of a patient’s carotid artery.
What happened: The 80% reasoning gap and the surgical 'presence'
In April 2026, researchers at Mass General Brigham released a study in JAMA Network Open that should have sent a chill through every hospital boardroom. Using a tool called PrIME-LLM—a standardized framework developed to evaluate performance across the four stages of Clinical Reasoning (the stepwise cognitive process used by healthcare professionals to evaluate data)—the team tested the latest "off-the-shelf" models, including GPT-5 and Claude 4.5. The results were catastrophic for the "AI-as-doctor" narrative: the models failed to produce an appropriate Differential Diagnosis more than 80% of the time in the early stages of patient evaluation.
For the uninitiated, a Differential Diagnosis is the clinical process of identifying a specific condition by differentiating it from others with similar clinical presentations. It is the "reasoning" part of the job. Marc Succi, a co-author of the study, noted that while these models are excellent at summarizing existing notes, they are fundamentally "not ready for unsupervised clinical-grade deployment," as reported by Euronews. The models consistently failed to narrow down the possible causes of a patient's symptoms, instead opting for broad, often dangerous generalizations.
The 80% failure rate specifically applies to the "early clinical reasoning" stage, where a physician must weigh competing hypotheses. In these scenarios, AI models often "lock in" on a single, incorrect path—a digital version of confirmation bias.
This algorithmic incompetence isn't limited to chat interfaces; it has physically entered the body. Since the integration of AI components into the TruDi Navigation System in 2021, the FDA has logged over 100 malfunctions and adverse events, including 10 serious injuries, according to an investigation by Reuters. Patients like Donna Fernihough and Erin Ralph suffered strokes and cerebrospinal fluid leaks because the AI-enhanced system allegedly misinformed surgeons about the location of their instruments. A lawsuit filed in Texas alleges that the TruDi system was "arguably safer before integrating changes in the software to incorporate artificial intelligence."
The defense of "User Confusion" and the first-year recall
Manufacturers and defenders of these systems argue that adverse event reports do not prove causality and that failures are often "user confusion" or display issues rather than core algorithmic flaws. Integra LifeSciences, the maker of the TruDi system, stated to Reuters that such reports "do nothing more than indicate that a TruDi system was in use in a surgery where an adverse event took place." Similarly, Medtronic attributed heart monitor failures to "user confusion" regarding how the AI displayed abnormal rhythms, rather than the algorithm itself missing the events.
However, data from Johns Hopkins and Yale suggests that this "user error" defense is a convenient shield for premature releases. Research published in JAMA Health Forum found that 43% of AI medical device recalls occur within the first year of authorization. This timeframe suggests that devices are being pushed into clinical settings before their real-world behaviors are understood. Whether a patient suffers because an algorithm miscalculated or because the AI’s interface was so counter-intuitive that a surgeon was "confused" into a mistake is a distinction without a difference for the person on the table. If 43% of your fleet is being pulled back to the shop within twelve months, the problem isn't the drivers; it's the engineering.
Why it matters: The regulatory vacuum and the illusion of accuracy
The surge in AI medical devices is occurring precisely as the FDA’s oversight capacity is being hollowed out. Between 2022 and 2026, the number of AI authorizations doubled, yet the agency has struggled with staffing in its Division of Imaging, Diagnostics, and Software Reliability (DIDSR). This has created what critics call a "Fast-Track Trap."
Manufacturers often use "final diagnosis" accuracy as a proxy for clinical safety in their marketing materials. An AI might boast 95% accuracy in identifying a specific lung nodule on a static scan, but that doesn't account for the Clinical Reasoning required to decide if that nodule warrants a risky biopsy. By treating the reasoning process as a "solved problem" for marketers, the industry ignores the OECD Incident 7970 data, which shows systematic failures in the early, messy stages of patient triage.
The FDA's current authorization loop relies heavily on manufacturer self-reporting. Without independent post-market surveillance, the true number of AI-related surgical malfunctions is likely significantly higher than the 100+ cases currently documented.
This regulatory vacuum is further complicated by the use of "black box" models. When Samsung Medison’s Sonio Detect AI was reported to the FDA for misidentifying fetal body parts in 2025, the response was largely bureaucratic. The "unpredictability" of the algorithm was framed as a technical hurdle rather than a disqualifying safety risk.
What's next: The move toward 'Clinical-Grade' standards
If we are to move beyond the era of "beta-testing on patients," the medical community is calling for a radical shift in standards. The PrIME-LLM benchmark is a start, but it needs to be a requirement, not a voluntary research tool. Marc Succi and others at Mass General Brigham are advocating for "unsupervised clinical-grade" requirements that force AI developers to prove their models can handle the complexity of Differential Diagnosis before they ever touch a patient record.
Furthermore, there is a growing demand for longer observation periods before full surgical automation is permitted. The 43% recall rate within the first year proves that the current "fast-track" authorization process is failing to catch algorithmic drift and interface hazards. Patients deserve to know when an algorithm is "steering the drill," and they deserve a regulatory body that doesn't lose its best experts to the very companies it is supposed to be monitoring.
Conclusion: Beta-testing the human brain
The evidence presented—from the 80% failure rate in LLM-based Differential Diagnosis documented in JAMA Network Open to the triple-digit malfunction logs of the TruDi system—strongly supports the thesis that we are in a safety deficit. The marketing of "Clinical Reasoning" in AI is currently a misnomer; what is being sold is sophisticated pattern matching that breaks under the pressure of real-world clinical ambiguity.
The "safety lag" is real, and the cost is being paid in strokes and cerebrospinal leaks. Until "clinical-grade" standards are enforced and the FDA closes the authorization loop with rigorous, independent post-market surveillance, the term "AI-enhanced" will remain a warning rather than a feature. We called it "reasoning," but for 80% of patients in the early stages of a diagnosis, it was just a very expensive, very confident mistake.