Abstract
Purpose:
To investigate whether fractal dimension (FD)–based oculomics could be used for individual risk prediction by evaluating repeatability and robustness.
Methods:
We used two datasets: “Caledonia,” healthy adults imaged multiple times in quick succession for research (26 subjects, 39 eyes, 377 color fundus images), and GRAPE, glaucoma patients with baseline and follow-up visits (106 subjects, 196 eyes, 392 images). Mean follow-up time was 18.3 months in GRAPE; thus it provides a pessimistic lower bound because vasculature could change. FD was computed with DART and AutoMorph. Image quality was assessed with QuickQual, but no images were initially excluded. Pearson, Spearman, and intraclass correlation (ICC) were used for population-level repeatability. For individual-level repeatability, we introduce measurement noise parameter λ, which is within-eye standard deviation (SD) of FD measurements in units of between-eyes SD.
Results:
In Caledonia, ICC was 0.8153 for DART and 0.5779 for AutoMorph, Pearson/Spearman correlation (first and last image) 0.7857/0.7824 for DART, and 0.3933/0.6253 for AutoMorph. In GRAPE, Pearson/Spearman correlation (first and next visit) was 0.7479/0.7474 for DART, and 0.7109/0.7208 for AutoMorph (all P < 0.0001). Median λ in Caledonia without exclusions was 3.55% for DART and 12.65% for AutoMorph and improved to up to 1.67% and 6.64% with quality-based exclusions, respectively. Quality exclusions primarily mitigated large outliers. Worst quality in an eye correlated strongly with λ (Pearson 0.5350–0.7550, depending on dataset and method, all P < 0.0001).
Conclusions:
Repeatability was sufficient for individual-level predictions in heterogeneous populations. DART performed better on all metrics and might be able to detect small, longitudinal changes, highlighting the potential of robust methods.
Retinal color fundus images are low cost, fast to acquire, and noninvasive; yet they provide a detailed picture of the retinal vasculature. Thus color fundus imaging could provide biomarkers for systemic disease,
1 a field of study sometimes referred to as
oculomics.
2 A particularly promising candidate biomarker is retinal fractal dimension (FD), which describes the complexity of the vessel structure. A less complex vasculature could indicate poorer retinal vascular health, and this in turn might correlate with vascular health elsewhere in the body. For instance, lower FD is associated with cardiovascular disease outcomes like myocardial infarction
3–5 and has also been studied in relation to neurovascular conditions like dementia.
6,7
Those are exciting and promising results, but whether they can be translated into useful tools for clinical practice is still an open question. Effect sizes and increases in predictive performance over baselines using basic, easily available information like age, sex, and smoking status are typically small. Thus, for individual level predictions, retinal traits like FD would need to have very low measurement noise, yet this has to date been understudied.
Studies also often exclude a large fraction of the available images due to insufficient quality, on the order of 25% to 45%
3,5,8 in datasets like UK Biobank that were specifically collected for research. These exclusions are especially problematic for clinical applicability of oculomics. If the measurement of the retinal trait of interest (e.g., FD) fails a quarter or half of the time, then that makes it impractical. Furthermore, being older, non-White, or male increases the risk of having poor-quality images,
9 and thus these exclusions introduce selection bias. This means that results of existing oculomics research might not apply equally well to everyone, and if we wanted to use FD in clinical practice, the measurement would systematically fail more often for some people (e.g. those of non-White ethnicity).
Thus we set out to investigate whether FD-based oculomics could be used for individual risk prediction by first evaluating FD's repeatability at a population and an individual level, without any image quality exclusions. We use two tools for computing FD: AutoMorph,
10 which follows the established paradigm of segmentation, skeletonization, and box counting; and deep approximation of retinal traits (DART),
11 which uses a novel paradigm of directly computing FD via a deep learning model that is trained to be more robust to image quality. We then examine how repeatability changes with the level of image exclusions because of quality and look at the relationship between measurement noise and image quality at the level of individual eyes.
We included two datasets for this study: First, the “Caldonia” dataset, which was collected at Glasgow Caledonian University, Glasgow, Scotland, United Kingdom. Second, the “Glaucoma Real-world Appraisal Progression Ensemble” or “GRAPE” dataset,
12 which was collected at the Eye Center of the Second Affiliated Hospital of Zhejiang University, Hangzhou, Zhejiang, China. Both studies had ethical approval and adhered to the Declaration of Helsinki. Participants in both studies signed a written consent form.
Table 1 shows a detailed overview of both datasets.
Table 1. Overview of the Datasets Used, Reporting Statistics for All Subjects We Included
Table 1. Overview of the Datasets Used, Reporting Statistics for All Subjects We Included
The Caledonia dataset was collected on a Topcon DRI OCT Triton Plus as part of a PhD project looking at choroidal thickness. Thus, the main focus was acquisition of optical coherence tomography (OCT) volume scans, but fortunately color fundus images were acquired at the same time for most scans. Multiple scans were taken on a single day, though in some cases the data collection was repeated due to insufficient OCT quality. Thus five subjects underwent imaging on two different days, three subjects on three days, and one subject on four days. We included every eye with at least five available color fundus images. The subjects were 20 students and six PhD candidates at Glasgow Caledonian University.
The GRAPE dataset was collected on a Topcon TRC-NW8 (108 eyes) and a Canon CR-2 PLUS AF (88 eyes) during clinical practice. The first examination was for suspected glaucoma, with subsequent follow-up visits to monitor progression. Subjects were treated with IOP decreasing drugs after their first visit and only those with glaucoma are included in the study. We included all eyes that had a baseline and follow-up color fundus image, taking follow-up images from the first follow-up visit that had an available image.
We analyze both datasets to examine FD in 132 subjects, imaged at two different locations with three different devices, covering a large age range, different ethnicities and both healthy and glaucomatous eyes. The Caledonia dataset provides relatively ideal conditions for repeatability, namely many images per eye, collected on the same or a handful of days in a research setting, in young adults that are generally easier to image. However, the color fundus images were not a focus during the data collection, so the quality will likely vary at least somewhat.
The GRAPE dataset is a longitudinal dataset with only one image per eye per visit and a mean follow-up time of 18.3 months. FD a measure of retinal vascular complexity and general vascular health, which could conceivably change between visits. Thus, even a perfectly repeatable method would not be expected to produce the same measurement for both visits. Furthermore, data were collected during clinical practice in a population that included over 60 year olds and thus image quality is likely more mixed. Especially because FD is calculated from the vasculature but for glaucoma the optic disc is most important, so images that were sufficient for the clinical purposes during collection might be suboptimal for calculating FD.
Based on these considerations, we expect Caledonia to provide a slightly optimistic estimate for repeatability, whereas GRAPE should provide a pessimistic, lower-bound for repeatability. Taken together, these two datasets will allow us to characterize the repeatability of FD well.
We used DART (short for “deep approximation of retinal traits”)
11 and AutoMorph
10 to calculate FD from the color fundus images. AutoMorph is a multistep pipeline consisting of a deep learning model for vessel segmentation followed by skeletonization and the box counting method to compute FD. This is a similar approach to other tools for calculating FD like VAMPIRE.
13 Changes to the AutoMorph pipeline (e.g., varying the box sizes used for the FD calculation) might affect its repeatability. Our goal in the present article is not to propose a new algorithm for computing FD or analyze potential modifications that could be made to AutoMorph but simply to use it as provided. We want to analyze the repeatability of AutoMorph as it is released, and this matches what the vast majority of researchers would do in practice, especially those without extensive programming knowledge.
DART, on the other hand, uses a single deep learning model to directly output FD from the image. DART's deep learning model was trained to replicate the output of VAMPIRE on images from UK Biobank with sufficient quality to apply VAMPIRE and achieved very high internal validity (Pearson correlation of 0.9572 on 14,907 held-out validation images). DART was trained to not just replicate VAMPIRE's output but also to be more robust to image quality. During the training progress, the model either received the original, high quality image or a poor quality version of it obtained by randomly adjusting brightness, contrast, and gamma, simulating imaging problems with anisotropic blur and gaussian noise, and adding artefacts to the images. These might ordinarily affect the output of pipelines to compute FD, which is undesirable. However, whether DART received the original or a degraded version of it, it was tasked to output the FD VAMPIRE calculated from the high quality image either way, encouraging it to ignore variations in image quality and thus be more robust.
We chose these methods because they are both openly available on Github, allowing researchers to easily and freely access them without seeking prior permission. Furthermore, AutoMorph is a method following the traditional paradigm of segmentation, skeletonization, and box counting, whereas DART uses a novel yet less tried paradigm. For transparency, we want to make the reader aware that two authors of this work (JE and MB) were involved in the development of DART and thus—despite our best efforts to be neutral and objective—the reader should critically examine the present work.
Previous work comparing retinal traits computed with different tools found poor to moderate interchangeability.
14,15 We think that the interchangeability of DART and AutoMorph, while tangential to our main research question, might be of interest to the reader. We conducted this analysis retrospectively and use the quality exclusion threshold of QuickQual
P(bad) < 0.8 that we later recommend based on our results. See the next section for a description of QuickQual. We use mean values per eye to reduce measurement noise and find that in Caledonia, DART and AutoMorph agree with a Pearson correlation of 0.6390 and Spearman correlation of 0.7096 (both
P < 0.0001). For GRAPE, they agreed with a Pearson correlation of 0.4418 and Spearman correlation of 0.4914 (both
P < 0.0001). Bland-Altman plots are shown in
Supplementary Figure S1. Thus both tools show a level of interchangeability that is comparable to what previous work reported for other tools.
The metrics above summarize repeatability in a population. However, we are also interested in repeatability at an individual level. Thus we propose the relative SD λ as a metric of individual-level measurement noise, \(\lambda = \frac{{SD\;of\;FD\;within\;eye}}{{SD\;of\;FD\;across\;eyes}}\). λ expresses how large the variation of FD within an eye is compared to the variation of FD between eyes. As SD is a sum of squared mean deviations, large errors are weighted more heavily, which we think is desirable in this context. Conceptually it is similar to Pearson correlation and ICC, although for λ smaller values are better. A λ of 0 implies no measurement noise, and the larger λ gets, the more noise there is. For convenience, we express λ in %. For λ, we use the SD of FD across eyes as estimated from the combined dataset, for the reasons explained in the previous section.
To examine the relationship between repeatability and image quality and to evaluate the robustness of the two methods, we first look at how λ changes in Caledonia as we exclude a larger share of images due to image quality. We consider exclusion percentages from 0% to 50%, which was chosen because it covers and spans slightly beyond typical values in the oculomics literature. Next, we relate λ and the worst image quality in a given eye. We take the worst rather than the mean quality as a single outlier could lead to a high λ. Recall that SD is based on squared differences from the mean, and thus a single large deviation influences the SD more than many small deviations. We compute the Pearson correlation between λ and worst image quality, and further plot them against each other to examine the relationship between the two. This could give some insight into whether there is a critical level of quality where repeatability decreases quickly. Finally, QuickQual-MEME's quality score is the probability of an image being bad. However, probabilities are constrained quantities which can be an issue for Pearson correlation. Thus we also evaluate the Pearson correlation between λ and the raw logit value logit(P(bad)) (i.e., the raw output of QuickQual-MEME) before applying the logistic linkage function. Note that the logistic linkage function is a monotonic, and thus the Spearman correlation is the same in both cases.
Figure 3 shows how the distribution of λ changes as images are excluded because of quality. When no images are excluded, the highest λ for DART and AutoMorph are 81.02% and 199.96%, respectively. This decreases to 16.19% and 37.00% when the worst 5% of images are excluded. The median λ for DART is 3.55% without any exclusions, which gradually decreases to 1.67% as more images are excluded. For AutoMorph, the median is 12.65% without exclusions which decreases to 6.77% with increasing levels of exclusions.
Interestingly, the minimum λ for AutoMorph was 4.80% without exclusions, and still 3.08% with 35% of the images being excluded. This contrasts with DART, which had a constant minimum λ of 0.40% even without exclusions. Thus, AutoMorph's best case λ was 7.5 to 12 times higher than that of DART. AutoMorph's median was 3.5 times higher without exclusion and three times higher at best, namely when 40% of the images were excluded. Overall, exclusion of poor quality images primarily removes very large outliers, whereas median and best case repeatability only change slightly.
The Pearson correlation between λ and worst quality in a pair for DART was 0.7550 in Caledonia and 0.5350 in GRAPE. For AutoMorph it was 0.7481 and 0.5606, respectively. If we instead compute the Pearson correlation between λ and logit(worst quality), for DART correlations are 0.8570 and 0.5915 in the two datasets, and for AutoMorph 0.8941 and 0.6082 (all P < 0.0001). Thus the raw logits of QuickQual-MEME's quality score are a better linear predictor of λ than the probability itself.
Figure 4 shows λ against the worst image quality in that eye. Cases of very high λ (>75%) all have very poor image quality (
P(bad) > 0.8), and around
P(bad) = 0.6 high measurement noise (>25%) appears to become more common. We can notice a visual difference between Caledonia and GRAPE. In GRAPE, there are cases of high λ even at good image quality and the correlation between λ and worst quality is lower. This is not unexpected, the long interval between images in GRAPE means that there could be due to changes retinal vasculature. However, λ is still clearly correlated with worst image quality. Thus, although there might be genuine changes in vasculature in GRAPE, cases of high λ are likely driven by poor image quality.
Both methods showed reasonable to good repeatability at the population level, even without any images being excluded. Interestingly, Pearson and Spearman correlations were comparable between Caledonia and GRAPE, despite the fact that GRAPE should provide a pessimistic lower-bound of performance for the reasons outlined in the Methods section. This is likely due to the low between-eyes SD of FD in Caledonia as subjects were relatively young and healthy. For DART FD, SD was five times higher in GRAPE and for AutoMorph 3.7 times. For a constant level of absolute measurement noise, smaller between-eyes SD will yield lower correlations.
DART showed higher repeatability than AutoMorph for all metrics, especially so on the Caledonia dataset. On the GRAPE dataset, both methods were more similar. This could be due to the long follow-up time, which means that differences in FD are a combination of genuine vascular changes and measurement noise, making differences in measurement noise between the two methods appear less pronounced.
At the individual level, repeatability in terms of λ was generally good, though there were some large outliers without quality exclusions. These outliers disappear even with modest levels of image quality exclusions. Repeatability improved generally as more images were excluded, but primarily affected large outliers.
Similar to the population-level metrics, DART had smaller λs than AutoMorph, both with and without quality exclusions. Interestingly, while robustness to image quality issues was a key motivation for DART's development, DART not only had smaller outliers at low levels of exclusions, but also clear advantage in best, median and worst case λ at any level of exclusions. Thus DART is also more repeatable in good quality images.
Based on the values of λ we observed in both datasets, both AutoMorph and DART might be applicable to individual-level risk prediction if we are targeting a population with large variation in FD, i.e. a more general population that is heterogeneous in age and systemic health, and especially if the expected effect on FD is large. The observed values of λ are generally small enough that we would rarely confuse high-, medium-, and low-FD individuals, especially when discarding images with very bad quality (i.e. QuickQual-MEME P(bad) > 0.8).
However, with median λ of 12.65% without exclusions and 6.64% with a high level of 40% of images excluded because of quality, AutoMorph would not be able to detect small changes (e.g., in a cohort with similar age and systemic health or when looking at longitudinal changes in an individual) and might not be useful for individual-level predictions if the effect on FD is small (e.g. for early-stage disease). DART on the other hand might be able to detect such small changes with a median λ of 3.55% without exclusions and 1.67% with exclusions, which would make it more useful for individual-level predictions and more appropriate for monitoring longitudinal changes.
Generally, these are encouraging results for the applicability of oculomics for individual-level predictions, especially considering the metrics on the GRAPE dataset provide a pessimistic lower bound because of its longitudinal nature. Although population and individual level is a common dichotomy in the literature, a more repeatable method is necessarily less noisy, and thus these results are also encouraging for population-level research. DART was more repeatable than AutoMorph even when excluding bad quality images, which highlights the value of designing robust methods for oculomics and retinal image analysis generally.
A key limitation of this work is the analyzed datasets. The GRAPE dataset is longitudinal and thus only provides a pessimistic lower bound of repeatability. On the other hand, the Caledonia dataset only contained healthy, relatively young adults and thus had low heterogeneity in FD. Additionally, there are endless alternative ways of analyzing the data at hand and further metrics that readers might be interested in.
Future work should examine repeatability of FD in additional, diverse cohorts. An ideal dataset for this would span a very wide age range, include diverse individuals with heterogeneous systemic health in different healthcare contexts, and have longitudinal data with multiple images per visit, so that measurement noise can be compared to longitudinal changes in the same individuals. Future work should also analyze the repeatability of additional tools such as VAMPIRE,
13 SIVA,
18 or IVAN,
19 as well as additional retinal traits like tortuosity and complexity index.
20,21
The authors thank all participants in the studies used in this paper. We especially thank Kai Jin and Juan Ye as well as their colleagues for making the GRAPE dataset openly available to the research community.
JE was supported by the United Kingdom Research and Innovation (grant EP/S02431X/1), UKRI Centre for Doctoral Training in Biomedical AI at the University of Edinburgh, School of Informatics. For the purpose of open access, the author has applied a creative commons attribution (CC BY) license to any author accepted manuscript version arising. M.O.B. gratefully acknowledges funding from: Fondation Leducq Transatlantic Network of Excellence (17 CVD 03); EPSRC grant no. EP/X025705/1; British Heart Foundation and The Alan Turing Institute Cardiovascular Data Science Award (C-10180357); Diabetes UK (20/0006221); Fight for Sight (5137/5138); the SCONe projects funded by Chief Scientist Office, Edinburgh & Lothians Health Foundation, Sight Scotland, the Royal College of Surgeons of Edinburgh, the RS Macdonald Charitable Trust, and Fight For Sight.
Disclosure: J. Engelmann, None; D. Moukaddem, None; L. Gago, None; N. Strang, None; M.O. Bernabeu, None