Abstract
Purpose:
With the decreasing cost of genomic sequencing, data driven analysis has become an increasingly important aspect of molecular diagnosis. Larger data sources have vastly improved post-sequencing processes such as annotation and filtering, yet ultimate determination of molecular causality is still left to experts of the field. Our goal was therefore, to design an algorithm whereby a list of patient variants could be scored and ranked according to likeliness of disease causality using information gleaned from large stores of genomic data. Such a system would need to learn from ever growing datasets, continuously evolving and optimizing.
Methods:
We have developed a statistical framework that integrates gene-disease association, inheritance pattern, and functional prediction, to rank mutant genes in a patient. Taking advantage of previously published mutant alleles and our internal database, we estimated the prior probabilities of each gene associated with a specific disease. This study focused on common retinal diseases of which many genes were associated with multiple diseases. The algorithm scores genes as likely to be causative based on known disease associations, and ability to fit a given inheritance pattern. Functional prediction scores were integrated as a means to define variant potency, with normalization based on genic level analysis of common variants.
Results:
The algorithm was trained with gene-disease association data from more than 1000 patients with various retinal diseases. We tested the algorithm on a separate cohort of 30 patients and compared the results to molecular diagnoses determined by human experts. Strikingly, our results showed over 85% correlation between the algorithm and human analysis. Of 20 high confidence human calls, 18 algorithmic calls where correlated. The algorithm was similarly unable to call 3 of 4 samples with no human determination, while discovering a possibly overlooked causative gene in the fourth of RP1L1.
Conclusions:
RES represents a significant step toward streamlining NGS based molecular diagnosis as it prioritizes the most likely disease causing genes through quantification, allowing researchers and diagnosticians alike to work efficiently and effectively. More importantly, this tool will continue to improve with the accumulation of larger data sets. Similar methods can be implemented for other human diseases as well.