Independent clinical AI safety research

The first independent safety standard for clinical AI.

AI is making safety decisions in hospitals. Nobody independent is checking if it works. Posognos publishes PsiBench, the first safety scorecard for clinical AI, built on the standards 2,000+ hospitals already trust.

See the evidence →

Why this matters

We are the first to independently evaluate whether clinical AI is safe.

AI is replacing the safety systems hospitals have used for decades. These new tools make calls that affect patient outcomes. We are measuring whether they actually work.

96%

Alerts overridden

Clinicians override the large majority of drug-safety alerts because most are not relevant. AI that cannot distinguish a fatal order from a routine one makes alert fatigue worse, not better.

$3.5B+

Preventable harm

Estimated annual U.S. cost of adverse drug events. Independent, continuously updated evaluation is the missing infrastructure for measuring whether AI moves that number.

PsiBench

How independent evaluation works.

PsiBench translates the clinical safety standards hospitals already trust into automated evaluation scenarios, runs them against AI models independently, and publishes the results.

Encode clinical standards

Clinical pharmacology experts translate established medication-safety standards into validated benchmark scenarios. Every scenario is grounded in the criteria the industry already audits against, and reviewed by named clinical authorities.

Evaluate independently

Posognos evaluates clinical AI models through EHR test environments and API endpoints using synthetic patient scenarios. No protected health information is accessed, generated, or stored.

Publish the results

Aggregate scores are published on the PsiBench scorecard, freely available to the public. Detailed failure analysis, expert annotations, and remediation guidance are available to subscribers.

The headline finding

No frontier model is ready to make
medication-safety decisions on its own.

In our methods paper, Posognos evaluated 40 frontier language models from 10 providers against 492 expert-authored medication-safety scenarios, across 59,040 independent evaluations. The headline number hides the variation that matters.

492

Validated safety scenarios

Safety categories

Frontier models

Model providers

59,040

Independent evaluations

9 / 40

Pass a basic deployment-readiness check

Operational balanced accuracy ≥ 80%, response time ≤ 15s, attribution match ≥ 80%. Thirty-one of forty fall short on at least one.

19.5%

Of correct alerts cite the wrong reason

A model can detect a hazard and still attribute it to the wrong clinical category, roughly one in five times, across the field.

6 – 82%

Specificity range, same headline F1

On a single number the field looks comparable. At the tier level, models reaching 100% sensitivity drop below 25% specificity, statistically the same as “alert on everything.”

2,000+

Hospitals already audited against the standard

The criteria PsiBench encodes are the criteria the industry already uses. We do not invent metrics. We make the existing ones measurable for AI.

Read the methodology →

Who we serve

Independent evaluation for everyone
who depends on clinical AI.

Whether you build clinical AI, deploy it, or set the standards for evaluating it, PsiBench gives you independent safety data you can act on.

⚙

AI labs

Prove your models are safe before procurement asks

Hospital systems are starting to require independent safety validation for clinical AI. The public score is the baseline; subscribers get the expert-validated intelligence that shows exactly what to fix.

Tier-level failure analysis with expert-annotated remediation
Continuous regression testing across model versions
Pre-release evaluation before a version ships

For AI Labs

⚕

Health systems & standards bodies

Evaluate the AI your vendors are selling you

Clinical AI vendors make safety claims you cannot independently verify. Compare products against the standards your organization already reports on, without building the testing infrastructure yourself.

Independent safety data on the AI in your environment
Comparison against the standards you already use
No EHR access, no PHI, no IT integration project

For health systems

For domain experts

Help define the standards
that will evaluate clinical AI.

PsiBench is built by the experts who write the standards. We are growing the validation network with individual thought leaders contributing scenarios across pharmacy, pharmacovigilance, quality, and safety. Named authorship. A small, elite group. Compatible with institutional positions.

Apply to the network →

Strong fits

Clinical pharmacists, chief pharmacy officers, medication-safety leads, pharmacy informatics
Pharmacovigilance experts, adverse-event surveillance, signal detection, regulatory reporting
Quality & safety leaders, CMOs, CMIOs, patient safety officers, accreditation specialists
Clinical informaticists, CDS configuration, EHR safety, alert governance
Standards & regulatory specialists, CMS SAFER, Joint Commission, ISMP, ICH-GCP, CIOMS

Standards coverage

Built on the standards 2,000+ hospitals trust.
Expanding across the regulatory landscape.

We do not invent safety metrics. We operationalize the clinical safety standards the industry already uses, so evaluation results are immediately meaningful to the organizations that rely on them.

National Medication Safety

CPOE evaluation. 2,000+ hospitals.

Live

CMS SAFER

EHR safety guides. Enforcement 2026.

In progress

Joint Commission

Accreditation safety standards.

Planned

ISMP

High-alert medication lists.

Planned

ICH-GCP

Clinical trial safety. International.

Future

CIOMS

Pharmacovigilance standards.

Future

Why Posognos

Built for independence from day one.

Credible safety evaluation requires independence from the organizations being evaluated, deep clinical expertise, and access to the standards the industry already trusts. Posognos was built on all three.

☤

Founded by the standard's authors

Posognos' founding experts co-created the national medication-safety evaluation used to audit 2,000+ U.S. hospitals. They bring decades of domain authority and direct relationships with the bodies that define clinical safety.

⚕

Expert-validated at every step

Every PsiBench scenario is built and peer-reviewed by named domain experts, clinical pharmacists, informaticists, and safety leaders from top U.S. and international institutions.

⚙

Structurally independent

Posognos is not funded by EHR vendors or AI labs. We do not consult for the entities we evaluate. Evaluation results are published independently. The integrity of the benchmark depends on it.

Posognos /poh-SOH-noss/ · the G is silent, from the Greek posos, "how much," and gnosis, "knowing." To know how much: the third clinical knowing, after diagnosis and prognosis.