A comparative analysis of the performance of large language models in the basic life support exam: comprehensive evaluation of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1
dc.contributor.author | Bulut, Bensu | |
dc.contributor.author | Öz, Medine Akkan | |
dc.contributor.author | Genç, Murat | |
dc.contributor.author | Gür, Ayşenur | |
dc.contributor.author | Yortanlı, Mehmet | |
dc.contributor.author | Yortanlı, Betül Çiğdem | |
dc.contributor.author | Yazıcı, Ramiz | |
dc.contributor.author | Mutlu, Hüseyin | |
dc.contributor.author | Kotanoğlu, Mustafa Sırrı | |
dc.contributor.author | Çınar, Eray | |
dc.date.accessioned | 2025-09-23T12:19:05Z | |
dc.date.available | 2025-09-23T12:19:05Z | |
dc.date.issued | 2025 | |
dc.department | Tıp Fakültesi | |
dc.description.abstract | Considering the growing role artificial intelligence technologies play in medical education, this study aims to provide a comparative evaluation of the performances of large language models ChatGPT-4o, Gemini 2.0, Claude 3.5 , DeepSeek R1 in the Basic Life Support (BLS) Exam. Materials , Methods: In this observational study, we presented four large language models with 25 multiple-choice questions based on the American Heart Association (AHA) guidelines. Questions were divided into two categories as knowledge-based (n = 14, 56%) and case-based (n = 11, 44%). Response consistency was ensured by presenting each question on three separate days to all models. Models' accuracy rates were assessed using overall accuracy, strict accuracy, and ideal accuracy criteria. Results: In the overall accuracy assessment, ChatGPT-4o and DeepSeek R1 models showed 100% success, and Gemini 2.0 and Claude 3.5 models achieved 96% success rate. All models performed perfectly on the case-based questions. On the knowledge-based questions, ChatGPT-4o and DeepSeek R1 scored full points, while Gemini 2.0 and Claude 3.5 achieved 90.9% success. Statistical analysis showed no significant difference between results (p = 0.368). Discussion: Large language models show high accuracy rates in BLS training. These technologies can be used in supportive roles in medical education, but human supervision is critical in clinical decision-making. | |
dc.identifier.doi | 10.4328/ACAM.22758 | |
dc.identifier.endpage | 581 | |
dc.identifier.issn | 2667-663X | |
dc.identifier.issue | 8 | |
dc.identifier.startpage | 578 | |
dc.identifier.uri | https://doi.org/10.4328/ACAM.22758 | |
dc.identifier.uri | https://hdl.handle.net/20.500.12451/14527 | |
dc.identifier.volume | 16 | |
dc.identifier.wos | WOS:001544294200010 | |
dc.identifier.wosquality | Q4 | |
dc.indekslendigikaynak | Web of Science | |
dc.institutionauthor | Mutlu, Hüseyin | |
dc.language.iso | en | |
dc.publisher | Bayrakol Medical Publisher | |
dc.relation.ispartof | Annals of Clinical and Analytical Medicine | |
dc.relation.publicationcategory | Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı | |
dc.rights | info:eu-repo/semantics/closedAccess | |
dc.subject | Artificial Intelligence | |
dc.subject | Large Language Models | |
dc.subject | Basic Life Support | |
dc.subject | Medical Education | |
dc.subject | ChatGPT | |
dc.subject | Resuscitation | |
dc.title | A comparative analysis of the performance of large language models in the basic life support exam: comprehensive evaluation of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1 | |
dc.type | Article |