Abstract
Purpose :
Data science research is dependent on large, well-compiled datasets. However, these datasets are difficult to acquire, and there are currently no best practices for sharing these data. An open dataset could address this gap, allowing researchers to investigate new hypotheses and develop more generalizable studies. This abstract describes factors considered in constructing such a dataset.
Methods :
A dataset containing medical record numbers (MRNs) of glaucoma patients, providers and their specialty departments, visit identifiers and dates, raw progress notes and medication lists extracted from the EHR, and statistical analysis from a previously published manuscript was used (Chen et al, Ophthalmology Science, 2021). These progress notes and medication lists were previously manually annotated for medications names, frequency, route, and indication. Each dataset element was reviewed for protected health information (PHI). If PHI was present, a decision was made to remove or de-identify the data field. Example data fields, including PHI, and their rationale for inclusion/exclusion are described in Table 1.
Results :
Patient MRNs, visit identifiers, and visit dates were the only data fields specifically with PHI, and were de-identified using an R library, anonymizer, which uses hash functions to encode identifying variables. Visit dates were shifted and truncated. Provider data was removed, and department data was included as is. Annotated medication data were paired with all data fields as a CSV and statistical code was included without modification. While medication lists were included as is, progress notes potentially contained PHI and required de-identification using a natural language processing algorithm, Philter (Python), with results verified by a clinician (JSC). These data could be uploaded to online data repositories such as Dryad or Figshare (Table 2), published as a Data Descriptor Article (Zarbin et al, TVST, 2021), and potentially used to develop or validate text-processing algorithms involving medication data.
Conclusions :
Processing and uploading datasets for open-source dataset publication is a feasible, inexpensive process and could become standard practice to increase collaboration as well as dataset accessibility in vision research.
This abstract was presented at the 2022 ARVO Annual Meeting, held in Denver, CO, May 1-4, 2022, and virtually.