Abstract
Purpose :
Prevalent issues in the field of genetics include temporal changes and inconsistency in variant representation and nomenclature in literature, clinical databases, and clinical laboratories which hinders secondary genetic analyses such as genotype-phenotype association studies. At eyeGENE®, we aim to standardize variant format in our database by assigning each unique variant an hg19 variant ID in chromosome-position-ref-alt (vcf) format to share with the wider scientific community.
Methods :
We developed a semi-automated variant standardization pipeline to convert eyeGENE® variants from HGVS cDNA format to the vcf format. Multiple tools including R, python, VEP, TransVar, InterVar, and VariantValidator were employed. Variants that failed the automatic conversion process were manually reviewed.
Results :
There were 101,137 unique variants in the eyeGENE® database as of 2019. Results were collected starting in 2007 from several different clinical testing facilities. The TransVar tool was found to be most flexible as it allows input formats of Gene:Transcript:HGVS cDNA, Gene:HGVS cDNA, or Gene:HGVS protein. VariantValidator is also sufficient for conversion when the transcript ID is known. Automatic pipeline successfully converted 81.9% (8303/10137) of variants. Variants that failed the automatic process included intronic variants in IVS format, incompatible Gene:Transcript:HGVS cDNA as shown by TransVar, variants without transcript information, inconsistent HGVS nomenclature, and typological errors. After correcting typos, wrong HGVS annotation, and performing manual checks of other variant information, such as HGVS protein and dbSNP ID, a substantial majority of the variants (84.7%) were successfully converted to vcf format following manual review.
Conclusions :
The conversion of HGVS to vcf format is necessary to develop interoperable datasets for genetic and genotype:phenotype correlation studies; however, it is time consuming and requires multiple tools when performed retrospectively. Lacking transcript ID and inconsistent HGVS annotation are major obstacles in this process. Using the vcf format (CHROM-POS-REF-ALT) facilitates data sharing between clinical labs and reduces the time and burden spent in reprocessing genetic data.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.