New frontiers in radiologic interpretation: evaluating the effectiveness of large language models in pneumothorax diagnosis

dc.authorid0000-0002-5629-3143
dc.authorid0000-0002-9521-1120
dc.contributor.authorBulut, Bensu
dc.contributor.authorAkkan Öz, Medine
dc.contributor.authorGenç, Murat
dc.contributor.authorGür, Ayşenur
dc.contributor.authorYortanlı, Mehmet
dc.contributor.authorYortanlı, Betül Çiğdem
dc.contributor.authorSarıyıldız, Oğuz
dc.contributor.authorYazıcı, Ramiz
dc.contributor.authorMutlu, Hüseyin
dc.contributor.authorKotanoğlu, Mustafa Sırrı
dc.contributor.authorÇınar, Eray
dc.contributor.authorUykan, Zekeriya
dc.date.accessioned2025-09-24T13:17:58Z
dc.date.available2025-09-24T13:17:58Z
dc.date.issued2025
dc.departmentTıp Fakültesi
dc.description.abstractBackground: This study evaluates the diagnostic performance of three multimodal large language models (LLMs)-ChatGPT-4o, Gemini 2.0, and Claude 3.5-in identifying pneumothorax from chest radiographs. Methods: In this retrospective analysis, 172 pneumothorax cases (148 patients aged >12 years, 24 patients aged ≤12 years) with both chest radiographs and confirmatory thoracic CT were included from a tertiary emergency department. Patients were categorized by age and pneumothorax size (small/large). Each radiograph was presented to all three LLMs accompanied by basic symptoms (dyspnea or chest pain), with each model analyzing each image three times. Diagnostic accuracy was evaluated using overall accuracy (all three responses correct), strict accuracy (≥2 responses correct), and ideal accuracy (≥1 response correct), alongside response consistency assessment using Fleiss' Kappa. Results: In patients older than 12 years, ChatGPT-4o demonstrated the highest overall accuracy (69.6%), followed by Claude 3.5 (64.9%) and Gemini 2.0 (57.4%). Performance was significantly poorer in pediatric patients across all models (20.8%, 12.5%, and 20.8%, respectively). For large pneumothorax in adults, ChatGPT-4o showed significantly higher accuracy compared to small pneumothorax (81.6% vs. 42.2%; p < 0.001). Regarding consistency, Gemini 2.0 demonstrated excellent reliability for large pneumothorax (Kappa = 1.00), while Claude 3.5 showed moderate consistency across both pneumothorax sizes. Conclusion: This study, the first to evaluate these three current multimodal LLMs in pneumothorax identification across different age groups, demonstrates promising results for potential clinical applications, particularly for adult patients with large pneumothorax. However, performance limitations in pediatric cases and with small pneumothoraces highlight the need for further validation before clinical implementation.
dc.identifier.doi10.1371/journal.pone.0331962
dc.identifier.issue9
dc.identifier.uri10.1371/journal.pone.0331962 S
dc.identifier.urihttps://hdl.handle.net/20.500.12451/14539
dc.identifier.volume20
dc.indekslendigikaynakPubMed
dc.institutionauthorMutlu, Hüseyin
dc.institutionauthorid0000-0002-1930-3293
dc.language.isoen
dc.publisherPLOS One
dc.relation.ispartofPLOS One
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.titleNew frontiers in radiologic interpretation: evaluating the effectiveness of large language models in pneumothorax diagnosis
dc.typeArticle

Dosyalar

Orijinal paket
Listeleniyor 1 - 1 / 1
[ X ]
İsim:
bulut-bensu-2025.pdf
Boyut:
590.63 KB
Biçim:
Adobe Portable Document Format
Lisans paketi
Listeleniyor 1 - 1 / 1
[ X ]
İsim:
license.txt
Boyut:
1.17 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama: