Abstract
Purpose :
Deep learning models have been dominated by Convolutional Neural Networks (CNN), and have led to unprecedented improvements in biomedical image analysis, including ocular imaging. The recently developed Vision Transformers (ViT) has surpassed CNN models in many application domains and is considered a generally superior model. The advantage of ViT over CNN is due to its ability to capture long-range dependencies in the image early in the model layers through the attention mechanism, and is generally more robust than CNN. In this paper, we evaluate and compare the performances of ViT and CNN-based models in classification of fungal and bacterial Infectious keratitis.
Methods :
We implemented both ViT and CNN-based models, on the Keras platform. We evaluated two ViT models - the original ViT and the ViT with Multilayer Perceptrons (ViT Mixer). We evaluated three CNN models, including ResNet, EfficientNet and MobileNet. All these models were pretrained on public domain ImageNet. The MobileNet has been reported to perform the best on a similar dataset. The data set contains images from handheld cameras collected from patients with culture-proven corneal ulcers in South India recruited as part of clinical trials conducted between 2006 and 2015. There are 671 images, with 440 fungal and 231 bacterial samples. We used 5-fold cross validation to evaluate the models, where the dataset was divided into 80% training data and 20% validation data for each fold and each model was trained 5 times. Model performance was measured by categorical accuracy, area under the curve (AUC), specificity, and sensitivity on the validation dataset.
Results :
Among the ViT models, the ViT-MLP Mixer performed better than the original ViT. Among the CNN models, the ResNet50CNN performed the best, and outperformed a model from a similar study. Interestingly, the ViT-MLP Mixer fell behind the ResNet50CNN, in all evaluation criteria including accuracy, sensitivity, specificity and area under the ROC curve. The performance discrepancies are significant.
Conclusions :
While Transformer-based models often outperform CNN-based models, it is not always the case. A transformer model requires more data to train, especially due to the complexity of the embedding layers and the attention module. When the dataset is small, a CNN model may be more appropriate, as is the case with our infectious keratitis data.
This abstract was presented at the 2023 ARVO Annual Meeting, held in New Orleans, LA, April 23-27, 2023.