Abstract
Purpose :
Recent advances in large language models (LLMs) have generated significant interest in their application across various domains including healthcare. However, there is limited data on their safety and performance in real-world scenarios.
This study uses data collected using a telephone-based conversational agent named Dora. Dora asks symptom-based questions to elicit patient concerns, and allows patients to ask questions about their post-operative recovery. We utilise real-world postoperative patient questions posed to Dora to examine the safety and appropriateness of responses generated by a recent popular LLM by OpenAI, ChatGPT.
Methods :
Sequential patient questions were collected during Dora calls following routine cataract surgery at 3-4 weeks. Calls took place as the standard of care across 2 UK hospitals. Patients consented to use of their anonymised data.
A text prompt was designed to provide ChatGPT with relevant contextual information and instruction to provide helpful and scientifically grounded answers. Questions, including mistranscriptions, were embedded into this prompt both with and without the addition of symptom data, and given to ChatGPT (December 15 version).
Each output was assessed for helpfulness, likelihood and extent of harm, clinical appropriateness, evidence of clinical reasoning and whether the question’s intent was addressed. Two ophthalmologists independently labelled each question-answer pair, and met to resolve conflicts.
Results :
The question dataset had 131 unique questions from 120 patients. On average, most answers were rated as addressing the question's intent. 59.9% of responses were rated ‘helpful’, and 36.3% ‘somewhat helpful’. Although harm was overall unlikely with 92.7% rated as ‘low’ likelihood of harm, there were a few answers where ‘sight loss or severe harm’ were possible from the responses, and 24.4% had the possibility of ‘moderate or mild harm’. 9.5% of answers were opposed to clinical or scientific consensus.
When we added symptom information, we observed an increase in the proportion of answers with inappropriate or incorrect content with no increase in the likelihood of clinical reasoning.
Conclusions :
Even with no fine-tuning and minimal prompt tuning, LLMs like ChatGPT have the potential to helpfully address routine patient queries following cataract surgery. However, important limitations around the safety of today’s models exist which must be considered.
This abstract was presented at the 2023 ARVO Annual Meeting, held in New Orleans, LA, April 23-27, 2023.