Addressing Bias in Physician-Machine Partnerships
This research program addresses how AI assistance can enhance physician decision-making on diagnosing inflammatory appearing skin disease and improve healthcare outcomes for patients.
In our most recent paper, we explore the question: When can AI assistance help physicians diagnose skin disease and when may it mislead physicians? In order to begin addressing this question, it is important to first benchmark physician diagnostic accuracy without AI assistance, so we can identify the benefits and drawbacks of AI assistance. Given the well-documented disparities in healthcare across patients' race, we intentionally curated a dataset of diverse skin tones to examine how physician-machine partnerships perform across light and dark skin.
Previously, we published two papers exploring algorithmic bias and the inter-rater reliability of Fitzpatrick Skin Types in clinical images of skin disease: Fitzpatrick 17k paper published at ISIC workshop at CVPR 2021 and the Towards Transparency paper published at CSCW 2022.
This research program includes contributions from MIT researchers Matt Groh, Caleb Harris, Luis Soenksen, P. Murali Doraiswamy, and Rosalind Picard and board-certified dermatologists Omar Badri, Roxana Daneshjou, and Arash Koochek and .
Deep learning-aided decision support for diagnosis of skin disease across skin tones (2024)
Abstract
Although advances in deep learning systems (DLS) for image-based med- ical diagnosis demonstrate their potential to augment clinical decision- making, the effectiveness of physician-machine partnerships remains an open question in part because physicians and algorithms are both susceptible to systematic errors, especially for diagnosis of underrepresented populations. Here we present results from a large-scale digital experiment involving board-certified dermatologists (n=389) and primary care physicians (n=459) from 39 countries to evaluate the accuracy of diagnoses submitted by physicians in a store-and-forward tele-dermatology simulation. In this experiment, physicians were presented with 364 images spanning 46 skin diseases and asked to submit up to 4 differential diagnoses. Specialists and generalists achieved a diagnostic accuracy of 38% and 19%, respectively, but both specialists and generalists were 4 percentage points less accurate for diagnosis of images of dark skin as compared to light skin. Fair DLS deci- sion support improved the diagnostic accuracy of both specialists and generalists by more than 33% but exacerbated the gap in the diag- nostic accuracy of generalists across skin tone. These results demonstrate that well-designed physician-machine partnerships can enhance the diagnostic accuracy of physicians, illustrating that success in improving overall diagnostic accuracy does not necessarily address bias.