AI Therapy Clinical Evidence Scorecard

Most AI therapy tools claim to be "clinically validated" or "evidence-based." But what does that actually mean? We graded every platform in our database by the quality and quantity of their published clinical evidence. The disparity is striking.

Published: April 4, 2026 | Last updated: April 4, 2026

The Big 2025 Development: Therabot — The First RCT of a Generative-AI Therapy Chatbot

In March 2025, Dartmouth researchers published the field's first randomized controlled trial of a generative-AI therapy chatbot in NEJM AI. The study (Heinz et al., 2025) randomized 210 adults with clinically significant symptoms of major depressive disorder, generalized anxiety disorder, or clinically high risk for feeding/eating disorders to either a 4-week Therabot intervention (N=106) or a waitlist control (N=104). The intervention group showed symptom reductions of approximately 51% (depression), 31% (anxiety), and 19% (eating disorders), with participants reporting a therapeutic alliance with Therabot comparable to working with a human clinician. Effect sizes were broadly comparable to RCTs of in-person CBT delivered over roughly twice the contact time. The result was meaningful enough to be covered in MIT Technology Review and STAT News as a watershed moment for the category.

Two caveats matter. First, Therabot is not publicly available — it is a research instrument developed at Dartmouth, not a consumer product. Larger-sample replication and head-to-head comparison against existing in-person treatment are still needed before clinicians or consumers should treat this as definitive. Second, Therabot's safety profile in the trial was strong, but the trial excluded participants at high suicide risk and clinical oversight was active throughout — safety in unmoderated, real-world consumer deployment remains an open question.

Our Evidence Rating System

Gold: 5+ randomized controlled trials (RCTs) with control groups, published in peer-reviewed journals
Silver: 1-4 peer-reviewed studies or large observational data with meaningful effect sizes
Bronze: Internal data, third-party reviews, or validated assessment instruments used (but no independent RCTs about the platform itself)
Unrated: No published clinical evidence about the platform's effectiveness

Platform

Audience

Evidence Rating

Published Research

Regulatory Status

Therabot (Dartmouth)

Research (not publicly available)

Gold

First-ever RCT of a generative-AI therapy chatbot — Heinz et al., NEJM AI, 2025 (N=210)

FDA Breakthrough Device designation (March 2026)

Wysa

B2C

Gold

30+ peer-reviewed papers including Inkster et al., JMIR mHealth, 2018 (real-world evaluation in NHS-deployed users)

CE-marked Class I (EU), NHS-endorsed (UK)

Woebot

B2B (consumer app shut down)

Gold

Multiple peer-reviewed RCTs — most validated AI chatbot globally; flagship study Fitzpatrick et al., JMIR Mental Health, 2017 (N=70 college students, significant PHQ-9 reduction)

FDA pathway attempted (cited as reason for B2C exit)

Lyssn

B2B

Gold

60+ peer-reviewed publications, 17+ years research

HIPAA compliant

Youper

B2C

Silver

Mehta et al., JMIR, 2021 — longitudinal observational study (N=4,517), anxiety d=0.57 + depression d=0.46 over 2 weeks. Not an RCT.

None

Elomia

B2B+B2C

Silver

Active clinical trial (NCT06725147); BMC Psychology study

None

MindDoc

B2C

Bronze

Uses validated instruments (PHQ-9, GAD-7); no platform-specific RCTs

EU Class I medical device

Talkiatry

B2C

Bronze

Internal outcomes data: 86% of patients report feeling better within 2 visits

None (licensed psychiatrists, not the platform)

Replika

B2C

Bronze

Mixed: Maples et al., 2024 reported 3% of 1,006 users credited Replika with halting suicidal ideation; critical Matters Arising response flagged self-selection bias + harm documentation. FTC complaint outstanding.

None — not a clinical tool

Bloom

B2C (discontinued)

Unrated

Content by licensed therapists; no independent studies

Discontinued Feb 2025

Blueprint

B2B

Unrated

Measurement-based care validated; no AI-specific studies

HIPAA compliant

Mentalyc

B2B

Unrated

UC Berkeley-backed; SOC 2 Type II; no clinical studies

SOC 2 Type II, HIPAA

Upheal

B2B

Unrated

No published research; user reviews only

HIPAA + BAA

Freed

B2B

Unrated

No mental health-specific studies; general scribe

HIPAA, SOC 2

Alma

B2B+B2C

Unrated

No published research

BBB accredited

SimplePractice

B2B

Unrated

No AI-specific research; most-used therapy EHR

HIPAA + BAA, HITRUST

Key Takeaways

Consumer apps have more evidence than therapist tools. Wysa (30+ papers) and Woebot (multiple peer-reviewed RCTs) invested heavily in clinical validation. None of the B2B AI note tools (Upheal, Mentalyc, Blueprint, SimplePractice, Freed) have published peer-reviewed studies about their AI accuracy.
Lyssn is the B2B exception. With 60+ peer-reviewed publications and 17+ years of research, Lyssn has the strongest evidence base of any B2B tool — but it's a training/QI tool, not a documentation tool.
"Clinically validated" is often marketing. Many platforms claim clinical validation without peer-reviewed RCTs. Using validated assessment instruments (PHQ-9, GAD-7) is not the same as validating the platform itself.
Regulatory status varies dramatically. Wysa has CE-mark and NHS endorsement. MindDoc has EU Class I classification. Most other platforms have no regulatory approval of any kind.
Evidence does not equal safety. HIPAA compliance, BAA availability, and SOC 2 certification are separate from clinical effectiveness research. A tool can be secure without being clinically validated.

Why This Matters

For a YMYL (Your Money or Your Life) topic like mental health, the distinction between evidence-based and evidence-free matters. Consumers choosing therapy apps deserve to know that Wysa has 30+ papers while Replika has an FTC complaint. Therapists choosing documentation tools deserve to know that none of them have published accuracy studies.

This scorecard is updated quarterly. If any platform publishes new clinical research, we will update their rating. Read our full methodology for how we evaluate clinical evidence.

Mental Health Information

In crisis?

AI Therapy Clinical Evidence Scorecard

The Big 2025 Development: Therabot — The First RCT of a Generative-AI Therapy Chatbot

Our Evidence Rating System

Key Takeaways

Why This Matters