How does AI perform compared to human expert panels in medical Delphi studies? A pilot study through the lens of pathology.

Loading...
Thumbnail Image

Contributor Profession (Non Medical)

Publication Date

Item Type

Journal Article

Language

Subject

Subject Headings

Journal Title

Journal ISSN

Volume Title

Abstract

Background: Since their inception, Delphi studies have been a key part of medical literature. They consist of an expert panel tasked with coming to consensus on answers to various questions where obtaining objective results is difficult or impossible, with ranked responses based on a Likert scale. The ability of artificial intelligence (AI), particularly large language models (LLMs), to perform this role traditionally assigned to a panel of experts has been scarcely explored in medicine. This study accordingly aimed to explore the feasibility of an "AI-run" Delphi study applied to the practice of pathology. Methods: A prior human-based Delphi study (PMID: 36603288) employed to forecast the future role of AI in pathology was repeated, but this time with LLMs (Llama 3, ChatGPT-4, and ChatGPT-3.5 based on availability at the time of the study). This was done at various temperature settings (0, 0.7, and 1.0), a measurement of how much an LLM prioritizes determinism versus creativity. Low temperature caused the models to be more deterministic and focused, whereas high temperature increased creativity. "Delphi-GPT" was created to automate prompts that entailed 5 trials for 180 questions, leading to data that were compared to the original human expert panel. Findings: All LLM and temperature combinations were able to reach consensus for a greater percentage of the 180 questions posed than human experts. Newer ChatGPT-4 and Llama 3 models performed better than ChatGPT-3.5. Whereas AI models and human experts did not always agree, the amount of agreement increased when the temperature setting was increased across all LLMs. Interpretation: LLMs are shown here to successfully be able to simulate a Delphi study in medicine. The data show that generative AI models were consistently able to reach greater degrees of consensus than human experts in their responses to 180 prompts related to the future practice of pathology. This serves as a proof-of-concept that one day, pending further robust methodological validation, AI could even serve as a surrogate for de novo Delphi studies that ordinarily would have relied on feedback from a panel of experts. The reliability of consensus/concordance achieved will depend upon the combination of LLM and temperature setting selected.

Journal

Journal of Pathology Informatics

Endorsement

Review

Supplemented By

Referenced By