New frontiers in radiologic interpretation: evaluating the effectiveness of large language models in pneumothorax diagnosis

Yükleniyor...
Küçük Resim

Tarih

2025

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

PLOS One

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

Background: This study evaluates the diagnostic performance of three multimodal large language models (LLMs)-ChatGPT-4o, Gemini 2.0, and Claude 3.5-in identifying pneumothorax from chest radiographs. Methods: In this retrospective analysis, 172 pneumothorax cases (148 patients aged >12 years, 24 patients aged ≤12 years) with both chest radiographs and confirmatory thoracic CT were included from a tertiary emergency department. Patients were categorized by age and pneumothorax size (small/large). Each radiograph was presented to all three LLMs accompanied by basic symptoms (dyspnea or chest pain), with each model analyzing each image three times. Diagnostic accuracy was evaluated using overall accuracy (all three responses correct), strict accuracy (≥2 responses correct), and ideal accuracy (≥1 response correct), alongside response consistency assessment using Fleiss' Kappa. Results: In patients older than 12 years, ChatGPT-4o demonstrated the highest overall accuracy (69.6%), followed by Claude 3.5 (64.9%) and Gemini 2.0 (57.4%). Performance was significantly poorer in pediatric patients across all models (20.8%, 12.5%, and 20.8%, respectively). For large pneumothorax in adults, ChatGPT-4o showed significantly higher accuracy compared to small pneumothorax (81.6% vs. 42.2%; p < 0.001). Regarding consistency, Gemini 2.0 demonstrated excellent reliability for large pneumothorax (Kappa = 1.00), while Claude 3.5 showed moderate consistency across both pneumothorax sizes. Conclusion: This study, the first to evaluate these three current multimodal LLMs in pneumothorax identification across different age groups, demonstrates promising results for potential clinical applications, particularly for adult patients with large pneumothorax. However, performance limitations in pediatric cases and with small pneumothoraces highlight the need for further validation before clinical implementation.

Açıklama

Anahtar Kelimeler

Kaynak

PLOS One

WoS Q Değeri

Scopus Q Değeri

Cilt

20

Sayı

9

Künye