Sequencing data (fastq-files) were uploaded to the Galaxy web platform (usegalaxy.eu),
15 as previously described.
16 Quality control was performed with FastQC (Galaxy version 0.72,
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ last access on 03/01/2021). Reads were mapped to the human reference genome (Gencode, GRCh38.p13, all regions, release 37) with RNA STAR (Galaxy version 2.7.7a)
17 using the Gencode main annotation file (Gencode, release 37). Two BAM files for each sample (one for each lane) were combined in one BAM file per sample using Merge BAM Files (Galaxy version 1.2.0). Reads mapped to the human reference genome were counted with featureCounts (Galaxy version 2.0.1)
18 using the aforementioned annotation file. The outputs of featureCounts were imported to RStudio (version 1.4.1103, R version 4.0.3). Gene symbols were determined based on the ENSEMBL database (release 103, download on March 23, 2021).
19 Genes with zero reads in all samples were removed from the analysis. To identify the top expressed genes in hyalocytes, transcripts per million were calculated based on the output of featureCounts (assigned reads and feature length), as previously described.
20 Expression profiles of all five cell types were adjusted for individual patients by the limma function removeBatchEffect
21 and by consideration within the linear model of DESeq2.
22 After principle component analysis,
22 normalized reads and differential gene expression were calculated using the R package DESeq2 (version 1.30.1) with default parameters (Benjamini-Hochberg adjusted
P values).
22 Transcripts with log2 fold change (log2FC) >2 or <-2 and adjusted
P value < 0.05 were considered as differentially expressed genes (DEG). Heatmaps were created with the R package ComplexHeatmap (version 2.6.2).
23 Gene ontology analysis and its visualization with dotplots and cnetplots was performed using the R package clusterProfiler (version 3.18.1).
24 Other data visualization was done using the ggplot2 package (version 3.3.3).
25 To investigate specificity of expression profiles, known cell type-specific marker genes for different immune cell populations were examined.
26,27 In addition, expression levels of the most specific marker genes, known from single-cell RNA sequencing of human retinal tissue,
27 of potentially contaminating cells, such as photoreceptors, retinal ganglion cells, bipolar cells, and endothelial cells, were studied. To analyze the similarity between transcriptional profiles of hyalocytes and the four other immune cell populations, the expression of each gene in each cell population was calculated as a percentile. The similarity was defined as one – delta percentile and quantified by Pearson correlation coefficient R
2 (with higher values meaning higher similarity between the 2 compared cell populations). To identify overexpressed genes in hyalocytes, upregulated genes in hyalocytes compared to all other cell populations were determined in a first step (log2FC > 2 and
P adjusted < 0.05). In a second step, the 10th and 90th percentile of expression for each gene were calculated for each cell type and only genes, where the 10th percentile of expression in hyalocytes was higher than the 90th percentile of expression in all other cell types were retained. The top ten overexpressed genes in hyalocytes were selected according to the specificity for hyalocytes, which was defined as the difference between percentiles. To validate these markers, their expression was analyzed in four independent and publicly available data sets. For this purpose, the transcriptional profiles of hyalocytes were included, which were recently FACS-isolated from patients undergoing vitrectomy due to macular pucker or macular hole by our group
3,28 (GEO accession: GSE147657). In addition, the expression profiles of human brain microglia
26 (GEO accession: GSE99074, samples with origin Netherlands), human monocytes and monocyte-derived macrophages
29 (GEO accession: GSE58310) were included. To ensure comparability of expression between different samples, all validation and training samples were integrated into one DESeq2 model. Clustering of all samples based on the expression of the top ten overexpressed genes in hyalocytes was performed using t-distributed Stochastic Neighbor Embedding (t-SNE) plots (perplexity = 16, theta = 0.0).
30