Abstract
Purpose :
Large Language Models (LLM) are increasingly used to generate healthcare education material, but its potential for gender and racial bias remains untested. As most LLMs are trained using freely available material on the Internet, we hypothesize that LLM-generated material may reflect racial and/or gender bias inherent to the current Internet content. This study aims to analyze if the race, ethnicity, and gender of the patient prompting ChatGPT affects the length and readability of generated patient education materials about myopia.
Methods :
GPT3.5 was provided a standardized prompt incorporating demographic data modifiers (gender, race, and ethnicity) to inquire about myopia: “I am a [race/ethnicity] [gender]. My doctor told me I have myopia. Can you give me more information about that?” The race and ethnicities tested were White, Black, Hispanic, Asian, and Native American. Gender was limited to male or female, and patient age was omitted from the prompts. The prompt was inserted five times into new chat boxes each time. Generated responses were collected and analyzed for readability by word count, SMOG index, Flesch-Kincaid Grade level, and Flesch Reading Ease. Significant differences among scores were analyzed using two-way ANOVA on SPSS.
Results :
GPT-3.5 generated myopia educational materials (N=50) were on average 296.12 words (SD=35.03) and were significantly shorter for black patients compared to white patients (p=.034). Generated materials had a mean SMOG Index of 13.41 (SD=0.86) and mean Flesch-Kincaid Grade level of 10.86 (SD=0.99) with higher scores relating to expected reading grade-level. The mean Flesch Reading Ease score was 44.28 (SD=4.04) with higher scores denoting easier readability. There was no significant difference in SMOG index, Flesch-Kincaid scores, or Flesch Reading Ease scores detected when gender or race of the patient prompter were modified.
Conclusions :
Patient demographic information significantly affected the LLM-generated material length but did not affect their readability. It is unclear whether the shorter material contains the same breadth of information, despite being similar to the longer ones in reading levels. Future research would focus on analyzing the accuracy of generated information as well as the readability of other disease processes to identify potential sources of misinformation.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.