Abstract
Purpose :
To assess the potential and capabilities of large language models (LLMs) trained on in-domain ophthalmology data
Methods :
Training LLMs within medical domains has significantly enhanced their performance, leading to more accurate and reliable question-answering systems essential for supporting clinical decision-making and educating patients. However, few studies have investigated the performance of LLMs in the ophthalmic domain, despite the growing interest in applying LLMs to various medical tasks. This study demonstrates two LLMs (Mistral and Llama) which are fine-tuned on a limited in-domain ophthalmic raw data (Journal abstracts, eyeWiki, articles within the ophthalmology category on Wikipedia, and textbooks) for a limited number of epochs, with Mistral trained for 4k steps and Llama for 12400 steps. To overcome resource limitations in fine-tuning, we utilized the QLoRA method, which trains a portion of the total 7b parameters of the model (9M for Mistral and 12M for llama). Then, both LLMs were compared with OpenAI’s GPT-4 model (1.7T parameters) using two distinct test sets: a set of expert-designed ophthalmic questions and a subset of the MedQA dataset curated using ophthalmology keywords. The evaluation results were quantitatively (accuracy) and qualitatively (reviewed by an ophthalmologist and GPT-4) recorded.
Results :
In our model assessment, an ophthalmologist highlighted GPT-4's superior performance over Llama, while Llama outperformed Mistral when presented with expert-designed questions. Employing GPT-4 to evaluate models on the same test set based on criteria such as comprehensiveness, correctness, medical terminology usage, and clarity on a scale from 1 to 10 reveals that GPT-4 has a marginal advantage, scoring an average of 8.2, compared to llama’s average score of 7.825. When subjected to the MedQA test set, the fine-tuned models (Mistral: 0.35, Llama: 0.25) exhibited a slight improvement in accuracy over the original models (Mistral: 0.34, Llama: 0.22), whereas GPT-4 achieved the highest accuracy of 0.68.
Conclusions :
Fine-tunning LLMs with limited parameter tunning with the QLoRA method not only showcased the effectiveness of these models but also underscored their adaptability in resource-constrained scenarios, highlighting their practical utility even with limited training time and resources.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.