Artificial intelligence has rapidly evolved from the experimental phase to the implementation phase in many image-driven clinical disciplines, including ophthalmology. A combination of the increasing availability of large datasets and computing power with revolutionary progress in deep learning has created unprecedented opportunities for major breakthrough improvements in the performance and accuracy of automated diagnoses that primarily focus on image recognition and feature detection. Such an automated disease classification would significantly improve the accessibility, efficiency, and cost-effectiveness of eye care systems where it is less dependent on human input, potentially enabling diagnosis to be cheaper, quicker, and more consistent. Although this technology will have a profound impact on clinical flow and practice patterns sooner or later, translating such a technology into clinical practice is challenging and requires similar levels of accountability and effectiveness as any new medication or medical device due to the potential problems of bias, and ethical, medical, and legal issues that might arise. The objective of this review is to summarize the opportunities and challenges of this transition and to facilitate the integration of artificial intelligence (AI) into routine clinical practice based on our best understanding and experience in this area.
ARTIFICIAL INTELLIGENCE, MACHINE LEARNING, AND DEEP LEARNING
Artificial intelligence (AI) was first proposed in print in a Dartmouth Summer Research Project in 1955.1 AI is a broad term referring to a branch of computer science that is hypothetically committed to developing computer algorithms for the tasks that have traditionally been accomplished by human intelligence, such as the ability to learn and solve problems. Machine learning (ML) is a division of AI that provides knowledge in the form of data to computers, along with observations to optimize the goodness of fit between input—including text, image, or video data—and output as a classification. A conceptual and engineering breakthrough by pioneers of the field, Yoshua Bengio, Geoffrey Hinton, and Yann LeCun enabled the development of artificial neural networks and deep learning (DL) to become a subfield of ML. This technology requires multiple processing layers to learn and detect features ranging from simple ones such as lines, edges, textures, and intensity, to complex features like shapes, lesions, and a whole image in a hierarchical structure.
Neural networks, inspired by simulating the neurons in the brain, include algorithms that are commonly used for image analysis today. These neural networks are composed of a number of layers of connected nodes, where each node receives information from other nodes and also sends a signal to other groups of nodes. The goal of the overall network is to find an answer that matches a defined ground-truth label by changing the pattern and weights of node connections via thousands of millions of attempts until the best match of the ground truth is achieved. Many types of neural structures have been proposed, representing various ways to cluster those nodes. The most common type of neural network used for image recognition is a convolutional neural network (CNN).2
The “training” of neural networks is conducted either by supervised learning, where a training set of data with annotations by humans to match the disease outcome are used, or unsupervised learning, in which the training data do not have annotations and where the algorithm strives to cluster or organize to “understand” the underlying patterns. The majority of ML systems to date in ophthalmology are developed using supervised learning,2,3 wherein the CNN analyzes pixel data from a large number of manually labeled images to determine a specific classification of disease type and severity.
CURRENT STATUS OF AI DEVELOPMENT IN OPHTHALMOLOGY
Ophthalmology is a branch of medicine that deals with the diagnosis and treatment of eye diseases. A number of imaging modalities have been used for the diagnosis of eye diseases; however, the interpretation of these images is highly dependent on the skill and experience of physicians, and this process is often subjective, with substantial interobserver variation.4–6 Evidence supporting this includes a study wherein even senior glaucoma specialists could only achieve a “substantial” level of agreement (κ = 0.63) for the classification of glaucoma based on optic disc photography. This agreement could be even poorer among general ophthalmologists (κ = 0.51) and trainees (κ = 0.50).6 Similarly, optometrists achieved sensitivity of 67% and specificity of 84% with diagnosing diabetic retinopathy (DR), which is below the recommended screening standard.7
Abràmoff et al8 and Gulshan et al9 published the first 2 articles on DL technology in the field of ophthalmology, with a task to detect DR based on retinal fundus photographs in 2016. These 2 studies both reported an area under the curve (AUC) achieving 0.98 to 0.99, a level of accuracy that is far better than any previous reports based on traditional pattern recognition of lesions. The power of DL, for the first time, captured the imagination of the whole field of ophthalmology. Since then, a number of research groups developed their algorithm based on similar CNNs and published their results on multiple eye diseases, including DR,10–14 glaucoma,15–17 age-related macular degeneration (AMD),18–21 retinopathy of prematurity,22–25 based on a variety of imaging modalities, including fundus photography, optical coherence tomography (OCT),26–31 visual field,32–34 and many others.35,36 Despite this variation, DL algorithm (DLA) applications primarily focus on 3 major tasks: classification, segmentation, and prediction using static images collected from medical devices.
The most common DLA application is to generate a global classification from a specific image, either with or without disease or on a specific disease severity scale. Some review articles have been published to summarize the performance of these DLAs with a variety of diseases,37–42 but several key points must be highlighted.
First, almost all the studies reported robust accuracy, described as AUC, sensitivity, and specificity, which is far better than what has been reported in human image graders and machine learning tools based on traditional pattern recognition algorithms. For example, in the classification of referrable DR, DLA achieves an accuracy of AUC 0.98 to 0.99 with a sensitivity of 0.97 (ranging from 0.89 to 0.99) and specificity of 0.96 (ranging from 0.98 to 0.99). These are comparable, if not better than, trained human graders where unanimous consensus grading by specialist experts was set as the benchmark.43
Second, almost all the studies used CNNs that are publicly available. Google Inception V3 is the most commonly used of these, followed by VGG-net, AlexNet, ResNet, and so on.9,10,13,27,44,45 The adoption of a neural network model often depends on its availability when the study is conducted. Several studies have used pretrained CNNs and transfer learning, achieving similar accuracy with relatively smaller sample sizes.17,26,46–48 In 1 study, 6 CNN models were simultaneously assessed on AMD classification; AlexNet yielded a better performance but the difference in performance among the networks was minimal.18
Third, CNN image classification is not used only with 2-dimensional (2-D) images but also with 3-dimensional (3-D) images such as OCT. A few studies have reported on the performance of OCT B-scans where 2-D data were used.27,28,31,49,50 Another recently published study developed and reported on a 3-D DLA of an OCT image to classify glaucomatous optic neuropathy when 3-D parapapillary retinal nerve fiber layer (RNFL) data were used.51
Fourth, DL is not used only for image classification but also for lesion detection and image segmentation. This task is more complex than image classification, consisting of creating a boundary definition around the objects in an image and classifying each of them. U-net is the most commonly used algorithm for this task, and it has proved to be accurate in the detection of exudates, hemorrhage, and optic discs in fundus photographs. It has also been accurate with segmenting OCT structures and detecting intraretinal fluid and subretinal fluid, and OCT pathologies such as neovascularization, macular edema, drusen, geographic atrophy, epiretinal membrane, vitreous traction, and macular holes.52–56
Fifth, in image classification tasks, DL makes classifications based on the global image instead of generating a classification after the detection of a specific lesion. This raises “black box” concerns where this classification is based on a one-size-fits-all neural network architecture that is not specific to disease. Heatmaps have become a popular method to highlight the pixel regions that contribute most to the DL classification. DL research has increasingly published their heatmap results where frequently, heatmap regions do not always necessarily match with the features that clinicians commonly used to differentiate disease, highlighting that the CNN may “see” things differently to humans.21,27,51,57,58 In fact, some studies have synthesized images through very small perturbations of the pixels, which can easily fool the CNN to produce a completely inaccurate output.59,60
Sixth, DL may go beyond simple classification of an image; it can also predict the prognosis or outcomes of a treatment when progression data are used as ground truth for training the algorithm. DL has demonstrated that image data alone, without referring to other known risk factors, is able to achieve reasonably good performance in predicting the prognosis for DR, AMD progression, and structural and functional progression in glaucoma, although most of these DLAs have yet to be independently validated.19,61–63
Seventh, DL is able to “see” the features that are not differentiable by humans in certain classification tasks, such as cardiovascular disease risk factors and smoking status. Poplin et al from Google demonstrated that a DLA trained with UK Biobank and US EyePACS datasets was able to classify age, current smoking status, blood pressure, body mass index, and even 5-year myocardial infarction with reasonably good accuracy among independent datasets.64 Their DLA was unexpectedly validated in a small group of 239 patients selected from a randomly selected Asian database during the publication peer review process.42 This finding is intriguing because it was able to prove that the features of the retina, as a biologically relevant end-organ for the vascular and neural systems, could be used to classify cardiovascular disease risks when other known risk factors were not included.
Despite all the fascinating advancements in AI technology, developments in how to translate and deploy AI technology into real-world ophthalmology practice remains challenging.
Challenges of AI algorithms in ophthalmology Challenge #1: Adequate Quantity of Training Data
A useful training dataset should exhaust all possible variations of disease phenotypes, including but not limited to the variations of disease severity, ethnicity, artifacts, types of fundus camera, and confounding of coexisting diseases. In this scenario, the clinical characteristics of the training set should be clearly delineated. For instance, an algorithm that was trained based on a dataset from a screening setting might not be appropriately used in hospital clinical settings, where disease severity is substantially different. An algorithm developed based on subjectively defined glaucoma without referring to the visual field or relevant clinical diagnostic data may not be appropriate for real-world deployment as an end product for glaucoma diagnosis, as the benchmark used in the training process is different from the purpose of deployment.
This dependency on large amounts of data for accurate algorithm development has become an impediment to the adoption of AI in clinical practice. Hospitals may have a large amount of data but not have good access to computer science and AI experts. The data in hospitals are often not well organized for meaningful data mining, and other obstacles such as regulation considerations, privacy protection, ethical issues, and legal concerns may further hinder data sharing. Similarly, computer science companies have the computing power and AI expertise, but they do not have access to clinical datasets. Although the “Big Nine” (Google, Alibaba, Amazon, Tencent, Apple, Baidu, Facebook, IBM, and Microsoft) have invested tremendous amounts of resources on AI development, that does not mean they had sufficient access to clinical data. It would be tremendously helpful to create freely available disease or device-specific shared data resources for computer experts to use in testing different algorithm designs. The publicly available ImageNet dataset that has been used to generate many breakthroughs in image recognition; the public datasets for DR, organized by Kaggle, which has been used by many developers for DR algorithms; the UK Biobank’s open-source data for eye disease classification and prediction model development are good examples of successful data sharing. Nevertheless, alternative learning methods have recently been proposed that can simulate how the human brain works and learn from fewer examples. Using generative adversarial networks, one group of researchers65–68 created or synthesized a large number of diverse and random computed tomography and magnetic resonance imaging images from scratch and claimed that the set of images could be used as a training set for future CNN development. These efforts however remain yet to be proven and have not achieved great success to date.69
Challenge #2: Appropriate Definition of Ground Truth and Labeling
A majority of algorithms are developed based on images retrieved directly from medical devices, and then a number of experts are asked to subjectively grade (or label) these images. The ground truth is determined by the unanimity of the graders along a continuum, such as normal, probably normal, indeterminate, probably abnormal, or definitely abnormal; or by simply classifying the images into normal or abnormal. This approach is often subject to significant misclassification, errors, and insufficiency because expert classification is subjective, and there is significant interobserver variation.70,71 The label from subjective classification does not necessarily contain clinically important information such as the likelihood of progression, potential treatments, responses to treatment, and so on. An ideal ground truth should be based on criterion standard definitions and retrieved from real-world clinical data such as pathology reports and electronic medical records, which often depend on multiple modalities of imaging devices and clinical procedure details that comply with diagnostic standards.
Challenge #3: Assessment of the Accuracy of AI Algorithms
Critical assessment of the accuracy of an AI algorithm should be based on widely recognized principles of evidence-based medicine, just like other new medicines or devices. An article in preprint that is released via online repositories, for instance, arXiv.org; peer-reviewed articles describing technical developments published in computer science journals; or even peer-reviewed articles reporting diagnostic performance in clinical journals do not necessarily fully establish the accuracy of technologies or justify the adoption of the technology in real-world clinical practice. In this context, an AI algorithm developed for diagnosis or classification should comply with the Standards for Reporting Diagnostic Accuracy (STARD) statement, and an AI algorithm developed for prediction should follow the Transparent Reporting of a Multivariate Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement for transparent reporting of study settings, study population, definitions and measurements of outcomes, time, and interval of follow-up, and so on, as AI algorithms are sensitive to the target population, equipment used, imaging protocols, and referral standards.72,73 A number of initiatives to develop specific guidelines for AI-based clinical trials are currently under deliberation.74–76 These include the Consolidated Standards of Reporting Trial (CONSORT)-AI, Specific Protocol Items: Recommendations for Interventional Trials (SPIRIT)-AI and TRIPOD-ML statements, which are extensions of existing guidelines that provide clinical trial protocol and reporting standards.
Overfitting is the most common bias in AI algorithm development, where the algorithm is overfitted to the training set. This can be driven by a mistake in training set development, for instance, using one type of fundus camera to collect images for the disease and another type of fundus camera for the normal group, so that the algorithm essentially ends up classifying the type of fundus camera instead of disease and nondisease. Reliable external validation of the data collected in newly recruited patients or at different sites, or using different models of device as an independent study, is the best way to mitigate overfitting problems. There are, in general, 2 approaches to external validation of algorithms noted in published studies—either a publicly available image dataset or a prospective study. In ophthalmology, Abràmoff et al,8 Gargeya and Leng,58 and Gulshan et al9 used the Messidor-2 dataset, E-Ophtha databases, or EyePACS-1 as a reference for external validation. However, an ideal validation should be based on images that have never been used in the training set, and a validation process conducted by researchers is often not able to ensure this is the case. Therefore, an ideal validation for an image dataset should be done on larger standardized datasets provided by an independent party using a setup similar to Kaggle’s public image recognition challenges, where the ground truth has not been disclosed, so that a neutral objective benchmark can be established.
Similar to the assessment of other new medicines or devices, the assessment of AI algorithms should ideally be carried out in a prospective clinical trial. Recently, Lin et al77 conducted a clinical trial to compare an AI-assisted cataract classification technique versus ophthalmologists in real-world settings, and found the AI technique was less accurate in diagnosing childhood cataracts than ophthalmologists but was more efficient. There have been 25 items registered under eye diseases on the clinical trials website (https://clinicaltrials.gov/) using terms such as “convolutional neural network,” or “deep learning,” or “machine learning,” or “artificial intelligence,” and “ophthalmology,” with most focused on DR, glaucoma, cataracts, visual acuity assessment, and so on.
Challenge 4: Mode of Care of AI Adoption
Currently available AI algorithms in ophthalmology fundamentally enable tasks of classification, detection, or segmentation. Classification refers to assigning an entire image or lesion in a particular image to a category, for example, the DR severity scale. Detection is done to identify a specific abnormality within an image, for example, choroidal neovascular lesions in an OCT image. Segmentation is done to identify or isolate a specific structure of interest in an image, such as isolating the RNFL in an OCT image.
The outcomes of interest in AI classification can be multiple and varied. The most common outcome of interest is separating abnormal from normal patients. This perhaps appears to be a very simple day-to-day task for ophthalmologists, but it could be challenging for noneye professionals, such as asking an endocrinologist to classify an image as either with or without referable DR, despite this possibly being part of their professional training for the management of diabetic complications. The second common outcome of interest is to classify or assign the images into severity categories or grading schemes, such as classifying an image on the DR severity scale. This classification task is often more difficult and less accurate than a dichotomous classification because an effective differentiation between certain grades could be challenging when the difference is minimal, for instance, differentiating normal and mild DR where the only difference is the presence of a microaneurysm. The third outcome of interest is to predict the prognosis of a disease, such as to differentiate progressive glaucoma from stable ones.
Challenge #5: Integrate AI Into Clinical Pathways
Bossuyt et al78 proposed 3 clinical pathways (triage, replacement, and add-on) to integrate new diagnostic tests into existing clinical pathways that would be appropriate for AI deployment.
In the triage model, AI algorithms can be used as a triage tool for opportunistic screening in noneye clinical settings or as a tool for assisting integration into pathways of grading in reading centers. In Australia, our research team has installed an AI system in endocrinology and primary care settings to enable opportunistic screening and targeted referral so that DR patients are referred to eye professionals (Fig. 1), whereas the normal ones stay for routine glucose management. In this research deployment model, the AI system was used to generate a report that would be reviewed by a qualified physician such as an endocrinologist who would sign off on the diagnosis. In this case, the AI is considered a decision support tool for diagnosis. In China, our research team works with a local software company to provide technical support for a nationwide DR screening program where the AI is used as the first triage tool for identifying “super-normal” cases, defined as classifications of normal and gradable in all 4 images collected for an individual (Fig. 2). In the case of super-normal, a report from AI is generated and sent to the patient at their point-of-care. A pilot study proved that this model could reduce the workload for telemedicine grading centers by >50%.
In a replacement model, the AI algorithm is used to replace the clinical diagnosis task of a clinician because the AI can be more accurate, rapid, and reproducible and less dependent on access to a clinician. This model of care is feasible only for tasks where AI is definitively superior to physicians (such as estimation of bone age in radiology) or for tasks that are simple enough to carry out, such as classifying the image as that of a right eye or left eye, but often require strict regulatory approval.
An add-on model of care refers to where AI is used as a procedure that is used in parallel with or after diagnosis from clinicians. This model is often used when the task involves time-consuming and repetitive work, such as counting lesions (eg, microaneurysms in DR grading) or automatically measuring RNFL thickness that involve lots of manual segmentation on a large number of OCT B-scans (Fig. 3).
Challenge #6: AI Clinical Adoption Is Beyond Clinical Consideration
Successful AI deployment in clinical practice requires the active involvement of all stakeholders, including patients, ophthalmologists, imaging technicians, hospital administrations, regulatory bodies, and industry. It is important to ensure that all stakeholders will benefit from AI deployment and are willing to collaboratively facilitate the development of best practices by integrating ethics, patient consent, privacy protection, data ownership and sharing, integration with existing electronic medical system, and user-friendly software interface for targeted clinical settings.
In 2016, the 21st Century Cures Act was signed into law, aiming to accelerate the discovery, development, and delivery of new cures and treatments. The US Food and Drug Administration’s (FDA’s) current strategic policy emphasizes leveraging innovation, including digital health technologies, and is highlighted by new software as a medical device (SaMD) and digital health regulatory approval pathways for AI and computer vision algorithms.79,80 To date, the FDA has approved several AI-based SaMDs, including the IDx ML-based software system for automated detection of DR and an AI product for use with computed tomography scans for indicators associated with stroke.81,82 Most of these approvals are for algorithms that are locked in before going to market, but the FDA is currently also considering SaMD regulatory frameworks for continuous learning and adaptive algorithms that are potentially able to be adapted to improve performance in real time.
Challenge #7: Security Issues in AI Deployment
Recently, various attack methods have proved effective against existing DL models. For example, some studies found that by adding adversarial noises to a raw image, a well-trained model could be successfully fooled into making a wrong decision that is totally opposite to the ground truth.83 The adversarial noise can be carefully designed to ensure the manipulated image is visually the same as the raw image such that it is almost impossible for humans to detect. Worse, this kind of adversarial attack can be conducted as a black box attack, meaning that no previous knowledge of the model details, such as information about the model’s structure or parameters, is required by an attacker.
The vulnerability of DL models has spurred the community to seriously rethink the security and robustness of AI in real-world deployments, especially in the medical domain.84 Scaling up AI systems for clinical use without any defendable countermeasures means that any falsified diagnosis could lead to considerable risks.
Case Study: IDx-DR and Real-World Challenges of DLA
The limitations of retrospective in silico validation of DLA are significant. Real-world prospective trials are now increasingly essential for clinical uptake and regulatory approval of DLA systems. In a landmark study by Abràmoff et al,12 the efficacy of the “IDx-DR” system, a fully autonomous screening system for more-than-mild DR (mtmDR), was evaluated in a real-world prospective trial of 900 patients across 10 primary care sites in the United States.
This trial is notable as it both addresses and reflects many of the key challenges described in this article. The DLA was assessed in a prospective clinical trial, where inclusion criteria and the clinical setting were strictly defined. Eligible participants were asymptomatic patients, diagnosed with diabetes and not previously diagnosed with DR, in a primary care setting. All images were captured with 1 retinal camera model, the Topcon NW400 system. Captured images were evaluated by 2 IDx-DR image quality and diagnostic algorithms, which determined the presence of mtm DR. Results produced by the DLA were compared with high-quality ground-truth Wisconsin Fundus Photograph Reading Center widefield stereoscopic photography and OCT.
The system achieved a sensitivity and specificity of 87.2% and 90.7%, respectively, meeting endpoints predetermined by the US FDA. This allowed it to achieve the first regulatory approval ever for an autonomous AI-based diagnostic system.85 Following this precedent, prospective real-world validation trials are now essential for future regulatory approval of DLA systems. Notably, however, regulatory approval for the IDx-DR system remains limited to the cohort that was defined in the system’s validation trial. This includes the detection of mtmDR only in adults diagnosed with diabetes who have not been previously diagnosed with DR, and in nondilated images captured by the Topcon NW400 camera.86
The real-world validation of DLA systems in less-controlled real-world settings remains immature. In the first human-centered observational study of a DLA in clinical care published in April 2020, Google Health researchers working in 11 clinics across Thailand encountered a number of socioenvironmental factors that limited the accuracy and adoption of a DR screening DL system.87 Challenges include clinic screening conditions, image gradeability affecting system performance, internet speed and connectivity, and the impact of referrals on patient time. For instance, several clinics reported issues with image gradeability as fundus images were captured in nondarkened clinics that resulted in insufficient pupil dilation and insufficient quality images. Alternative darkened clinic rooms could not be found. Images were often rejected for grading by the DLA, requiring multiple attempts that added frustration and work to an already busy clinic. Furthermore, the DLA system required images to be uploaded to the cloud for assessment. The study sites often experienced slower and less reliable connections that slowed down the overall screening workflow, and reduced the number of patients a clinic could screen daily. Lastly, a large number of patients were discouraged from participation in the study, after understanding during the consent process that a positive screening result would require further assessment in a hospital an hour drive away, opting out to avoid possible additional time burden.
Thus, as described in this article, successful AI deployment in clinical practice requires the involvement of all stakeholders, including patients, ophthalmologists, imaging technicians, hospital administrations, regulatory bodies, and industry. End-users and their environment determine implementation, which may be as important as the accuracy of the algorithm itself. Early and material consideration of these real-world factors will be essential to the successful future clinical deployment of DLA systems.
There have been arguments made and concerns expressed that AI will replace professionals in future practice. However, one should note that currently, supervised ML is typically trained to discriminate features based on a trusted training set for only a limited assigned task, whereas humans are able to transfer experience and expertise to a new task through reasoning. A DLA system may be able to classify the presence and severity of a limited number of predefined diseases, to segment image structures, or even predict disease prognosis more accurately and perhaps more efficiently than humans. However, it is unable to make a valid diagnosis of diseases that it has not been trained for, nor able to accurately perform clinical reasoning based on multimodality data and experience, interact with patients properly, and perform treatment procedures like human doctors. To paraphrase the prominent AI expert Ng, the measure of a good AI technology is that it does well what humans can do, but easier and quicker, in 1 second.88 This will likely remain true until the development of the “singularity”: a hypothetical future point in time when “general AI” becomes available such that machines can learn, reason, and create like humans, undoubtedly and unforeseeably changing human civilization.