New frontiers in radiologic interpretation: evaluating the effectiveness of large language models in pneumothorax diagnosis
dc.authorid | 0000-0002-5629-3143 | |
dc.authorid | 0000-0002-9521-1120 | |
dc.contributor.author | Bulut, Bensu | |
dc.contributor.author | Akkan Öz, Medine | |
dc.contributor.author | Genç, Murat | |
dc.contributor.author | Gür, Ayşenur | |
dc.contributor.author | Yortanlı, Mehmet | |
dc.contributor.author | Yortanlı, Betül Çiğdem | |
dc.contributor.author | Sarıyıldız, Oğuz | |
dc.contributor.author | Yazıcı, Ramiz | |
dc.contributor.author | Mutlu, Hüseyin | |
dc.contributor.author | Kotanoğlu, Mustafa Sırrı | |
dc.contributor.author | Çınar, Eray | |
dc.contributor.author | Uykan, Zekeriya | |
dc.date.accessioned | 2025-09-24T13:17:58Z | |
dc.date.available | 2025-09-24T13:17:58Z | |
dc.date.issued | 2025 | |
dc.department | Tıp Fakültesi | |
dc.description.abstract | Background: This study evaluates the diagnostic performance of three multimodal large language models (LLMs)-ChatGPT-4o, Gemini 2.0, and Claude 3.5-in identifying pneumothorax from chest radiographs. Methods: In this retrospective analysis, 172 pneumothorax cases (148 patients aged >12 years, 24 patients aged ≤12 years) with both chest radiographs and confirmatory thoracic CT were included from a tertiary emergency department. Patients were categorized by age and pneumothorax size (small/large). Each radiograph was presented to all three LLMs accompanied by basic symptoms (dyspnea or chest pain), with each model analyzing each image three times. Diagnostic accuracy was evaluated using overall accuracy (all three responses correct), strict accuracy (≥2 responses correct), and ideal accuracy (≥1 response correct), alongside response consistency assessment using Fleiss' Kappa. Results: In patients older than 12 years, ChatGPT-4o demonstrated the highest overall accuracy (69.6%), followed by Claude 3.5 (64.9%) and Gemini 2.0 (57.4%). Performance was significantly poorer in pediatric patients across all models (20.8%, 12.5%, and 20.8%, respectively). For large pneumothorax in adults, ChatGPT-4o showed significantly higher accuracy compared to small pneumothorax (81.6% vs. 42.2%; p < 0.001). Regarding consistency, Gemini 2.0 demonstrated excellent reliability for large pneumothorax (Kappa = 1.00), while Claude 3.5 showed moderate consistency across both pneumothorax sizes. Conclusion: This study, the first to evaluate these three current multimodal LLMs in pneumothorax identification across different age groups, demonstrates promising results for potential clinical applications, particularly for adult patients with large pneumothorax. However, performance limitations in pediatric cases and with small pneumothoraces highlight the need for further validation before clinical implementation. | |
dc.identifier.doi | 10.1371/journal.pone.0331962 | |
dc.identifier.issue | 9 | |
dc.identifier.uri | 10.1371/journal.pone.0331962 S | |
dc.identifier.uri | https://hdl.handle.net/20.500.12451/14539 | |
dc.identifier.volume | 20 | |
dc.indekslendigikaynak | PubMed | |
dc.institutionauthor | Mutlu, Hüseyin | |
dc.institutionauthorid | 0000-0002-1930-3293 | |
dc.language.iso | en | |
dc.publisher | PLOS One | |
dc.relation.ispartof | PLOS One | |
dc.relation.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | |
dc.rights | info:eu-repo/semantics/closedAccess | |
dc.title | New frontiers in radiologic interpretation: evaluating the effectiveness of large language models in pneumothorax diagnosis | |
dc.type | Article |