A comparative analysis of the performance of large language models in the basic life support exam: comprehensive evaluation of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1

dc.contributor.authorBulut, Bensu
dc.contributor.authorÖz, Medine Akkan
dc.contributor.authorGenç, Murat
dc.contributor.authorGür, Ayşenur
dc.contributor.authorYortanlı, Mehmet
dc.contributor.authorYortanlı, Betül Çiğdem
dc.contributor.authorYazıcı, Ramiz
dc.contributor.authorMutlu, Hüseyin
dc.contributor.authorKotanoğlu, Mustafa Sırrı
dc.contributor.authorÇınar, Eray
dc.date.accessioned2025-09-23T12:19:05Z
dc.date.available2025-09-23T12:19:05Z
dc.date.issued2025
dc.departmentTıp Fakültesi
dc.description.abstractConsidering the growing role artificial intelligence technologies play in medical education, this study aims to provide a comparative evaluation of the performances of large language models ChatGPT-4o, Gemini 2.0, Claude 3.5 , DeepSeek R1 in the Basic Life Support (BLS) Exam. Materials , Methods: In this observational study, we presented four large language models with 25 multiple-choice questions based on the American Heart Association (AHA) guidelines. Questions were divided into two categories as knowledge-based (n = 14, 56%) and case-based (n = 11, 44%). Response consistency was ensured by presenting each question on three separate days to all models. Models' accuracy rates were assessed using overall accuracy, strict accuracy, and ideal accuracy criteria. Results: In the overall accuracy assessment, ChatGPT-4o and DeepSeek R1 models showed 100% success, and Gemini 2.0 and Claude 3.5 models achieved 96% success rate. All models performed perfectly on the case-based questions. On the knowledge-based questions, ChatGPT-4o and DeepSeek R1 scored full points, while Gemini 2.0 and Claude 3.5 achieved 90.9% success. Statistical analysis showed no significant difference between results (p = 0.368). Discussion: Large language models show high accuracy rates in BLS training. These technologies can be used in supportive roles in medical education, but human supervision is critical in clinical decision-making.
dc.identifier.doi10.4328/ACAM.22758
dc.identifier.endpage581
dc.identifier.issn2667-663X
dc.identifier.issue8
dc.identifier.startpage578
dc.identifier.urihttps://doi.org/10.4328/ACAM.22758
dc.identifier.urihttps://hdl.handle.net/20.500.12451/14527
dc.identifier.volume16
dc.identifier.wosWOS:001544294200010
dc.identifier.wosqualityQ4
dc.indekslendigikaynakWeb of Science
dc.institutionauthorMutlu, Hüseyin
dc.language.isoen
dc.publisherBayrakol Medical Publisher
dc.relation.ispartofAnnals of Clinical and Analytical Medicine
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.subjectArtificial Intelligence
dc.subjectLarge Language Models
dc.subjectBasic Life Support
dc.subjectMedical Education
dc.subjectChatGPT
dc.subjectResuscitation
dc.titleA comparative analysis of the performance of large language models in the basic life support exam: comprehensive evaluation of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1
dc.typeArticle

Dosyalar

Orijinal paket
Listeleniyor 1 - 1 / 1
[ X ]
İsim:
bulut-bensu-2025.pdf
Boyut:
293.83 KB
Biçim:
Adobe Portable Document Format
Lisans paketi
Listeleniyor 1 - 1 / 1
[ X ]
İsim:
license.txt
Boyut:
1.17 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama: