New Delhi: As chatbots are increasingly used to understand symptoms or test results, a study has shown that AI tools may not work as well in conversations closer to real-world interactions, even when they perform well in medical tests. evidence. The study, published in the journal Nature Medicine, also proposes recommendations for evaluating large language models (LLMs) (which power chatbots like ChatGPT) before using them in clinical settings. LLMs are trained on massive text datasets and can therefore respond to a user’s requests in natural language.
Researchers from Harvard Medical School and Stanford University, USA, designed a ‘CRAFT-MD’ framework to evaluate four LLMs, including GPT-4 and Mistral, to determine their performance in environments that closely mimic the interactions real with patients.
The framework looked at how well an LLM can gather information about symptoms, medications and family history and then make a diagnosis. The performance of the AI tools was tested on 2,000 clinical descriptions, presenting common conditions in primary care and across 12 medical specialties.
The LLM was made to impersonate a patient, answering questions in a conversational style. Another AI agent rated the accuracy of the final diagnosis made by the AI tool, the researchers described.
Human experts then evaluated the results of each patient interaction to determine an LLM’s ability to collect relevant information from the patient, diagnostic accuracy when presented with scattered information, and compliance with instructions.
All LLMs were found to show limitations, especially in their ability to reason and conduct clinical conversations based on information provided by patients. This, in turn, compromised the AI tool’s ability to take medical histories and make an appropriate diagnosis, the researchers said.
“Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history taking, and diagnostic accuracy,” the authors wrote.
For example, AI tools often had difficulty asking the right questions to gather relevant patient history, omitted critical information during history taking, and had difficulty synthesizing scattered information, the team said.
These AI tools also performed worse when engaged in back-and-forth exchanges (as most real-world conversations are) than when engaged in summarized conversations, the researchers said.
“Our work reveals a surprising paradox: While these AI models excel on medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” said senior author Pranav Rajpurkar, assistant professor of biomedical informatics. at Harvard Medical School, said.
“The dynamic nature of medical conversations—the need to ask the right questions at the right time, gather scattered information, and reason through symptoms—poses unique challenges that go far beyond answering multiple-choice questions.
“When we move from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy,” Rajpurkar said.
The authors recommended that the performance of an LLM in clinical settings should be evaluated by its ability to ask the right questions and extract the most essential information.
They also recommended using open-ended, conversational questions that more accurately reflect real-world doctor-patient interactions when designing, training, and testing AI tools.>