Abstract
Purpose :
While genome-wide association studies (GWAS) have identified hundreds of genes associated with POAG risk, such studies are limited in their ability to characterize risk genes with rare variation or small effect sizes. We hypothesized that unsupervised learning on large protein-protein interaction (PPI) networks could enable comprehensive characterization of the genomic pathways that underlie POAG risk.
Methods :
We first generated a proteome-scale PPI network using high-confidence protein interactions from STRING and used the node2vec algorithm to learn vector representations of all gene products in the network. We identified all genes with known POAG associations in DisGeNET, yielding 294 POAG-associated genes with representation in the PPI network. We trained a regularized logistic regression model on the embeddings to learn a continuous POAG association score for each gene and performed Monte Carlo cross-validation to evaluate performance. To characterize the proteome-scale risk landscape, we identified discrete clusters of POAG-associated gene embeddings using k-means clustering. We annotated each cluster using overrepresentation analysis (ORA) with gene ontology biological process (GO-BP) terms.
Results :
The model generated continuous POAG risk scores for all genes with representation in the PPI network. It identified known POAG risk genes with an area under the receiver operating characteristic curve (AUROC) of 0.739 (95% CI 0.686-0.792). These included well-known POAG risk genes such as RHOA, VEGFA, and MMP3, as well as genes with significant contributions to other ocular diseases, such as HSP90AA1 (macular degeneration) and PTGES3 (dry eye disease). K-means clustering identified 5 clusters of gene embeddings. Each cluster was distinct in terms of the functional pathways it comprised, with GO-BP enrichment analysis implicating cytokine signaling, coagulation pathway, collagen and extracellular matrix development, and fatty acid metabolism.
Conclusions :
Unsupervised representation learning on proteome-scale PPI networks offers a means of inferring gene-specific POAG disease risk even for genes with limited experimental evidence. Inferred POAG risk genes fall into clusters spanning distinct functional pathways.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.