Last winter, a head nurse in our psychiatric unit confided to me, “The dashboard indicates we’re low-risk. Yet during night shifts, I feel unsafe even walking to the bathroom.” The monthly quality report on her desk echoed what it had stated for almost a year: “Violence incidents: no significant difference among the three wards (p > .05).” On paper, her ward appeared standard. In reality, it was quite the opposite.
Her unit treated more high-acuity patients, experienced significantly higher turnover, and employed restraints more often. The issue was not with the staff. The problem was not the patients. The statistics were to blame.
**The error: treating incident counts as if they were average values**
The reassuring report stemmed from a prevalent statistical mistake. The analyst utilized ANOVA, a technique created to compare averages, to analyze the counts of violent incidents.
In hospitals, two fundamentally different types of numbers exist:
– **Counts:** The frequency of occurrences (20 violent incidents, 7 falls, 6 code blues).
– **Means:** The average level of something (average documentation hours, average pain scores, average blood pressure).
Counts respond to “how many.” Means respond to “how much.” They are not interchangeable.
In our hospital, the three wards reported the following:
– Ward A (psychiatric): 20 violence incidents
– Ward B (medical): 7 incidents
– Ward C (surgical): 6 incidents
Any clinician would recognize the obvious distinction. However, ANOVA does not perceive “20 vs. 7 vs. 6” as we do. It converts them into averages per patient. Assuming each ward managed approximately 100 patients, the figures translate to:
– 0.20 incidents per patient
– 0.07 incidents per patient
– 0.06 incidents per patient
Once transformed, the notable difference shrinks into three minor decimals. Due to the low event counts and the fact that ANOVA is unsuitable for binary events, it easily deduces that the difference might be coincidental. The official report therefore states: no significant difference.
It’s akin to using a ruler to determine how many cats you possess. The incorrect tool makes very different groups appear identical. A chi-square test, which is meant for categorical counts, would have almost certainly identified Ward A as genuinely higher risk.
However, applying the incorrect method conveyed the wrong message: All wards are equivalent.
**The human impact of “no significant difference”**
Once the report was circulated, the repercussions were swift and distressing.
– Requests for more staff from the psychiatric unit were declined. Leadership believed the ward’s risk did not present a statistically higher level.
– Concerns raised by frontline nurses were categorized as emotional instead of being evidence-based.
– Administrators felt assured in the p-value, believing they were acting justly.
Meanwhile, the disparity between data and reality widened.
Nurses grasped a frustrating insight: The numbers on the presentation do not reflect their working environment. Some departed. Those who remained shouldered the workload and emotional burden.
**Then the AI system arrived, trained on the same erroneous data**
Three months later, the hospital launched an AI tool aimed at predicting agitation and violence. The concept was straightforward: instruct the model using previous incidents, then identify high-risk patients.
But the AI absorbed the same statistical error that claimed all three wards presented identical risks. To the algorithm, every ward seemed alike.
The psychiatric ward soon faced an influx of alerts. Medium-risk patients were classified as high-risk, while genuinely unstable patients were sometimes overlooked. A junior nurse remarked, “When everyone is high-risk, no one is high-risk.”
Alert fatigue set in. A tool intended to enhance safety was now eroding trust.
**When AI overrides clinical judgment**
During one hectic evening, our 62-year-old attending physician reviewed the AI overlay for a newly admitted patient. The interface displayed a calm green label: low risk of agitation.
The charge nurse disagreed. She observed the patient pacing, exhibiting facial tension, and raising their voice. “I have a bad feeling about this,” she stated.
Pressed for time and influenced by the AI’s assured label, the attending sided with the model. Ten minutes later, the patient struck a resident in the face.
Afterward, the attending muttered, “Maybe I’m getting old. Maybe the AI perceives things I don’t.”
However, the AI wasn’t observing more. It was reiterating the flawed statistics it had been trained on. The damage wasn’t solely the physical injury. It was the self-doubt instilled in a clinician with years of expertise.
**A second issue: halting at ANOVA without conducting post-hoc tests**
Another error arose from a different category of analysis.
When the hospital compared average documentation time across three departments, ANOVA was appropriately employed. The p-value was less than 0.01, indicating a genuine difference. Yet the analysis ceased there. No one inquired about the next steps.