Development and Validation of an Automated Diabetic Retinopathy Screening Tool for Primary Care Setting

We report the construction of a fully automated artificial intelligence deep learning (DL) software on a secure Health Insurance Portability and Accountability Act–compliant cloud-based platform for effective and efficient screening of referable and nonreferable diabetic retinopathy (DR) in the primary care setting. DR is one of the leading causes of blindness in the U.S. and other developed countries. Early detection is the key for prevention. Currently, screening for DR is done by ophthalmologists or optometrists with limited catchment areas and also requires time-consuming referral. Many patients (some studies suggest almost one out of three) do not get these exams (1).

The critical goal is the screening of “referable” DR for referral to an ophthalmologist. A 5-point scale (2) (no DR and mild, moderate, severe, and proliferative DR) may be used for grading DR based on the presence and extent of microaneurysms, exudates, hemorrhages, and other abnormalities. By definition, no DR and mild DR are considered nonreferable, and the other categories are considered referable. Previous automated DR screening techniques are of varying accuracy and performance, but recently DL approaches have had greater success.

DL refers to a class of machine learning techniques that learns features directly from images without feature labels and usually requires very large training data sets such as the publicly available kaggle data set (KG-set) (88,702 subjects) (3). DL has also been applied for detecting diseases such as macular degeneration and melanoma.

Our DL screening system was initially built on 70% of the KG-set and validated prospectively. Complete details will be presented in a forthcoming publication. Conceptually, a model for retinal fundus image classification must be robust to learning features on a wide scale, from small microaneurysms to large fronds of neovascularization. Therefore, three DL neural networks were used (Xception, Inception-V3, and Inception-ResNet-V2), each operating at one or two image resolutions to give five networks total and increase robustness to image features of different sizes. We have also used this ensemble approach in age-related macular degeneration (4). Each of the five networks then produces a set of five probabilities of a given image belonging to each of the five DR classes: none and early, intermediate, severe, and proliferative DR. The 25 total probabilities are then input to a logistic model tree (LMT), combining logistic regression and decision tree learning. The LMT is trained to decide the DR class based on the totality of the DL inputs and is the final classifier.

We then validated our system on the remaining 30% of the KG-set and the complete publicly available MESSIDOR data set (3) of 1,748 high-resolution fundus images. We achieved high sensitivities and specificities (98–99% in each category)—equivalent to those achieved by human experts. Significantly, we compared our model with a well-known DL algorithm developed by IDx (5) on MESSIDOR, which achieved a sensitivity of 97% and specificity of 87.0% for referral-level DR. Our algorithm’s specificity was 99% compared with IDx’s 87%. Indeed, to the best of our knowledge, our system performs at least comparably with other existing screening systems on these data sets.

To test real-world performance, we then examined prospectively acquired nonmydriatic retinal images from 974 patients with diabetes with approval by the Institutional Review Board of Mount Sinai in a primary care setting at the New York Eye and Ear Infirmary of Mount Sinai on a Topcon TRC-NW400 camera between 1 January 2017 and 31 December 2017. A total of 814 patients were judged by two retina specialists (S.S. and M.G.) to have no DR; 83 had mild DR, 12 had moderate DR, 2 had severe DR, and 5 had proliferative DR. These gold standard gradings were compared with analysis by our DL system, which achieved a sensitivity of 82.6%, adequate for U.S. Food and Drug Administration approval of a screening system, and a specificity of 93.7% for referral-level DR.

A telemedicine platform was also created that integrates the server-side programs (DL modules) and local remote devices for collecting patient data and images. A report on the automated analysis of the patient’s DR status (referable or nonreferable) can be sent within a minute to the primary care provider.

The validations of our DR screening tool on the KG-set and MESSIDOR data set, and the high accuracy on the Mount Sinai data set taken prospectively, suggest that it is suitable for the primary care setting. The system should be tested prospectively on a large scale in clinical settings on the integrated telemedicine platform. If the excellent clinical performance of the tool thus far is confirmed, deployment in primary care settings for early diagnosis of DR will be warranted.