The Medical Second Opinion Problem In both the practice of machine learning and the practice of medicine, a significant challenge is presented by disagreements amongst human labellers. This label disagreement challenge becomes a full-fledged clinical problem in the healthcare domain, where individual ‘labellers’ are now highly trained human experts, doctors, and the labels are diagnoses or recommended clinical paths of action. Patient cases are typically viewed by one expert, but some cases naturally give rise to significant variations of expert diagnoses, even despite the domain expertise. We define this clinical problem of identifying patient cases with high disagreements as the medical second opinion problem.

Machine Learning and Medical Second Opinions This expert disagreement is often not just due to random mistakes, but to the presence of specific features in patient cases that give rise to bias and misinterpretation in human judgement. This suggests applying machine learning to predict which cases may give rise to the greatest disagreement. These cases can then be automatically given an additional attention.

Retinal Fundus Images and Diabetic Retinopathy (DR) Our main task looks at retinal fundus images, large scans of the retina, which can be used to diagnose a variety of eye diseases, of which we concentrate on Diabetic Retinopathy (DR). DR is graded on a five class scale: None, Mild, Moderate, Severe, Proliferative, corresponding to grades 1 to 5. An important clinical threshold is at grade 3 and above, which corresponds to Referable DR, which requires immediate specialist attention. Therefore, doctors are especially careful not to diagnose a referable patient as non-referable.

Direct Uncertainty Prediction We develop convolutional neural networks to take in these retinal images and predict how much disagreement will arise via a single 0/1 score. We call these models direct uncertainty predictors. They are different to the classification-based approach to predict uncertainty, which is a two step process of (i) training a classifier (ii) using classifier outputs to predict uncertainty. We find that direct uncertainty predictors outperform all classification baselines in the multiple tasks considered.

Adjudicated Evaluation Our train and holdout data consists of a relatively large dataset T for which each image has typically between 1-3 labels (individual doctor grades), often with high levels of disagreement between the doctors. To ensure an accurate evaluation of our models, we turn to an a small adjudicated dataset, where each image has many individual labels by specialists and also a single adjudicated grade, decided upon via a discussion between a set of experts.

The adjudicated dataset has significantly different statistics from the train and holdout dataset T (e.g. much higher levels of agreement) but enables us to test whether the models can accurately identify cases where an individual doctor is most likely to disagree with the unknown ground truth condition (the adjudicated grade.)

For this and for other evaluations (e.g. ranking images by maximal uncertainty), we witness a strong performance by the models, particularly the direct uncertainty predictors, and despite the differences between the training and adjudicated evaluation data. See the paper (https://arxiv.org/abs/1807.01771) for more details! We’re excited to see what other future work can be done on ML for medical second opinions, along with other use cases of direct uncertainty prediction!