Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases

dc.contributor.authorCesur, Turay
dc.contributor.authorGunes, Yasin Celal
dc.date.accessioned2025-01-21T16:43:16Z
dc.date.available2025-01-21T16:43:16Z
dc.date.issued2024
dc.departmentKırıkkale Üniversitesi
dc.description.abstractBackground Recent studies have highlighted the diagnostic performance of ChatGPT 3.5 and GPT-4 in a text -based format, demonstrating their radiological knowledge across different areas. Our objective is to investigate the impact of prompt engineering on the diagnostic performance of ChatGPT 3.5 and GPT-4 in diagnosing thoracic radiology cases, highlighting how the complexity of prompts influences model performance. Methodology We conducted a retrospective cross-sectional study using 124 publicly available Case of the Month examples from the Thoracic Society of Radiology website. We initially input the cases into the ChatGPT versions without prompting. Then, we employed five different prompts, ranging from basic task -oriented to complex role-specific formulations to measure the diagnostic accuracy of ChatGPT versions. The differential diagnosis lists generated by the models were compared against the radiological diagnoses listed on the Thoracic Society of Radiology website, with a scoring system in place to comprehensively assess the accuracy. Diagnostic accuracy and differential diagnosis scores were analyzed using the McNemar, Chisquare, Kruskal-Wallis, and Mann -Whitney U tests. Results Without any prompts, ChatGPT 3.5's accuracy was 25% (31/124), which increased to 56.5% (70/124) with the most complex prompt ( P < 0.001). GPT-4 showed a high baseline accuracy at 53.2% (66/124) without prompting. This accuracy increased to 59.7% (74/124) with complex prompts ( P = 0.09). Notably, there was no statistical difference in peak performance between ChatGPT 3.5 (70/124) and GPT-4 (74/124) ( P = 0.55). Conclusions This study emphasizes the critical influence of prompt engineering on enhancing the diagnostic performance of ChatGPT versions, especially ChatGPT 3.5.
dc.identifier.doi10.7759/cureus.60009
dc.identifier.issn2168-8184
dc.identifier.issue5
dc.identifier.pmid38854352
dc.identifier.urihttps://doi.org/10.7759/cureus.60009
dc.identifier.urihttps://hdl.handle.net/20.500.12587/25223
dc.identifier.volume16
dc.identifier.wosWOS:001236292800029
dc.identifier.wosqualityN/A
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakPubMed
dc.language.isoen
dc.publisherSpringernature
dc.relation.ispartofCureus Journal of Medical Science
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_20241229
dc.subjectprompt engineering; radiology; large language models; gpt-4; chat generative pre-trained transformer (chatgpt)
dc.titleOptimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases
dc.typeArticle

Dosyalar