What the study evaluated
The study examined how well artificial intelligence can detect fractures on musculoskeletal X-rays compared to radiologists. A total of 600 pediatric and adult radiographs were analyzed, with the final “ground truth” diagnosis established by experienced readers. The diagnostic performance of the AI system Carebot AI Bones and four radiologists with varying levels of experience was then directly compared.
Study results in clinical practice
The AI system showed higher sensitivity than most radiologists, meaning it missed fewer fractures, particularly in body regions where fractures are more common. Radiologists, on the other hand, were generally more conservative and achieved higher specificity, resulting in fewer false-positive findings. In practice, this suggests that AI can serve as an effective safety net against missed fractures, while final clinical decisions remain with the doctor.
Key numbers
AI sensitivity: 88% (fewer missed fractures)
Radiologists’ sensitivity: approximately 70–83%
Radiologists’ specificity: up to 99% (very few false positives)
Fracture detection using radiography is crucial for effective patient management. Despite advances, missed fractures remain a significant issue. This study evaluates the diagnostic performance of a deep learning model versus radiologists in identifying fractures on musculoskeletal X-rays. For the purpose of our study, we collected a study sample (n_SAMPLE) of 600 pediatric and adult radiographs, and retrospectively analyzed the images by two ground truth readers, four radiologists in a multi-reader study with varying experience, and an AI model (Carebot AI Bones 1.2.2 Carebot s.r.o.). The ground truth was reached for 548 images (n_GT), including 95 fracture cases (n_FRACTURE) and 453 normal cases (n_NORMAL). The results demonstrated that the AI system achieved a sensitivity (Se) of 0.884 (0.804–0.934) and a specificity (Sp) of 0.879 (0.845–0.906). In comparison, the radiologists' sensitivity ranged from 0.695 (0.596–0.778) to 0.832 (0.744–0.894) and their specificity ranged from 0.962 (0.941–0.976) to 0.993 (0.981–0.998). The AI model outperformed radiologists in Se across various body parts, particularly in areas with higher fracture prevalence, while showing comparable Sp in some categories. This study highlights the potential of AI to enhance diagnostic accuracy in clinical practice.




