From categorizing brain tumors on MRIs to scanning for diabetic retinopathy to automatically transcribing physicians’ notes for electronic health records, artificial intelligence undoubtedly has the potential to transform the healthcare field. Harvard Medical School professor David Blumenthal is optimistic about coming developments, telling the HPR, “I don’t think we have fully conceptualized how much value there is in the vast pool of digitized information that is growing by galactic proportions, and it does form an underutilized unnatural resource. AI is going to enable us to make sense of it better than we could before.”
Nonetheless, leveraging AI’s potential carries its own set of risks. Perhaps even more so than in other areas, AI’s biases pose a significant challenge in the expansion and implementation of these technologies in the medical field.
AI can only be trained with existing data, often resulting in a lack of exposure to populations that have been historically undersampled, such as individuals with uncommon illnesses or members of racial minority groups. This can produce biases within AI algorithms. As Project Lead Kasia Chmielinski at the national nonprofit the Data Nutrition Project, or DNP, observes in an interview with HPR, “You lose a bunch of representation in that data. So the AI that you’re going to build is going to work really well on the populations it was trained on; it’s not going to work well for the other ones.”
Novel initiatives highlighting and combatting underlying dataset biases offer potential pathways to the advancement of AI technologies that, rather than exacerbate pervasive healthcare disparities, improve the quality of medical care for all.
Assessing Data Quality
The DNP was founded in 2018 as part of Harvard and MIT’s joint Assembly Fellowship with the mission of promoting the development of technologies that do not perpetuate entrenched stereotypes. Chmielinski, who was part of the original fellowship team, recalled that the DNP arose out of the realization that many issues identified in final AI products, after extensive costs and efforts had been invested into their development, could be linked back to limitations in the original datasets they were trained on. “If you have biased data, you have biased outcomes,” Chimielinski notes. “There’s no such thing as an unbiased dataset, but there are shockingly few tools for exploratory data analysis at the beginning of the [AI development] process.” To fill that void, the DNP designed an openly accessible platform to create “Dataset Nutrition Labels” so that those seeking to employ a dataset can better understand for which studies it would be most appropriate.
Given the variety of differently formatted datasets with wide-ranging use cases, setting universal standards or strictly quantitative metrics of the “health” of a dataset is difficult and possibly misguided. Furthermore, Chmielinski remarks that data scientists record and examine measures of dataset distribution and other computable features routinely. Therefore, the DNP’s labels highlight properties of datasets that do not become apparent to individuals when simply looking at the data.
To identify these attributes, the DNP solicits a breadth of information from the creators of the dataset, such as their motivation, the intended purpose in constructing the dataset, and any recognized concerns or limitations with the data which should be included in the Dataset Nutrition Label. Critically, they inquire about which populations were accounted for and the reasons for including those specific groups, what data was missing and how it was handled, and what criteria were used to remove any data.
Ultimately, these characteristics and descriptions assist potential users of the dataset to evaluate whether it is suitable for their applications, account for its boundaries, and seek other datasets as needed for their specific purpose. Chmielinski remarks, “We’ve had a way broader impact in terms of the acknowledgment and recognition that we need to have some ways of talking about data quality that explain where it came from, who’s represented in the dataset, and how it should and shouldn’t be used.”
For example, the DNP has been collaborating with researchers at Memorial Sloan Kettering Cancer Center on a grant from the Burroughs Wellcome Fund. They are investigating how the FDA, which is responsible for approving AI software since it is currently classified under the medical devices category, could regulate dermatological datasets for the development of apps such as those with photo-based skin cancer diagnosis features.
Chmielinski seems cautiously optimistic about these apps: “There’s a real lack of dermatologists in certain areas of the world, so this kind of app could give people access to medical care who otherwise can’t get it,” they explain. However, they note that the accuracy of these apps may be compromised on certain populations as “these underlying training datasets don’t have a good representation of all skin colors and types; it’s definitely biased towards lighter skin colors and Western contexts.” Hence, for the DNP, one crucial element in this partnership is investigating the datasets used and contemplating how broader requirements around the representation of various groups in AI training data could be established.
Healthcare Data Regulation: Considerations and Challenges
Given the inadequacy of many existing sources to sufficiently include diverse populations, researchers, medical professionals, and AI developers must work to generate equitable datasets.
At the U.S. federal level, regulations specific to data collection for healthcare AI training are relatively weak, with limited oversight beyond the 1996 Health Insurance Portability and Accountability Act. Known as HIPAA, this act specifies that certain “covered entities,” including healthcare providers and their business associates, may not process or share patient data for purposes outside of the entity. Yet to be exempt from this regulation, allowing for the data to be disclosed or purchased by a software developer and employed for training AI models, covered entities can deidentify data by removing all 18 elements of Protected Health Information, or PHI, named in HIPAA, which include IP addresses, SSNs, and fingerprints, among other key identifiers.
One further consideration for those collecting patient health data is the 1991 Federal Policy for the Protection of Human Subjects. Known as the Common Rule, this policy establishes the way in which informed consent, where participants voluntarily opt in to a procedure after reviewing its advantages and risks with a healthcare provider, must be obtained when conducting research involving identifiable human subject data. Nevertheless, both HIPAA and the Common Rule are no longer applicable once data has been deidentified, a term that, according to University of Michigan professor Nicholson Price in an interview with HPR, becomes increasingly challenging to conceptualize with the advent of AI-based reidentification schemes.
However, the process of creating new datasets, especially large, deidentified ones or ones involving complicated consent forms, often requires extensive time and resources that are not readily available in many healthcare systems. These efforts may further exacerbate biases in AI training data as, currently, only a select portion of well-funded hospitals are able to invest personnel into correctly recording, structuring, and deidentifying patient data.
Out of 56 research studies published between 2015 and 2019 on PubMed, training data collected for each study to develop an AI algorithm came predominantly from three states — California, Massachusetts, and New York — with over 70% of studies including data from one of these states to train their model. Thirty-four states, even some of the most populous ones such as Florida, Illinois, and Georgia, were entirely absent in all 56 studies. Patient cohorts in the three states may not accurately represent the national population, forming obstacles in the generalizability of the models and their accuracy when applied to wider populations.
Expanding Datasets for AI Use
To combat these barriers, several recent initiatives have launched large-scale proposals to create representative, deidentified, quality datasets. The National Institute of Health’s All of Us Research Program, for instance, has publicized genomic sequence data from more than 245,000 volunteers as of February 2024. More than three-quarters of volunteers are from groups who have traditionally been marginalized in medical research data, and the study aims to engage over one million individuals in total. The analysis of this dataset uncovered over a quarter billion newly discovered genetic sequences. The Million Veterans Project has undertaken a similar endeavor to understand how both genetic and environmental factors contribute to human health outcomes.
Price commends these plans, telling the HPR, “We want a diverse set of high-quality data. We recognize that that’s not easy for places to do on their own and that there is bias in who tends to collect and make available the data.” Addressing these goals necessitated financing a collaborative and centralized initiative to gather reliable and accurate data.
Furthermore, AI systems themselves could be leveraged to facilitate the creation of quality datasets. For example, they could automate the data deidentification process or summarize consent forms to increase comprehensibility for patients willing to participate in human subjects research studies.
Applying AI systems that could autonomously and consistently remove protected identifiers from datasets could significantly reduce the costs of creating datasets and open the doors for more inclusive data to be deidentified and used for training AI models. For example, the company Integral has been focusing on providing such autonomous systems that replace expensive manual deidentification practices, specifically in a healthcare context and with regard for HIPAA and PHI guidelines. However, as Price suggests, from a legal perspective, forming statutes around liability and accountability in cases where an AI deidentification system’s efforts yield a privacy breach could be complex and may require extensive discussion.
AI tools could also support the expansion of broader human subjects research datasets in accordance with the Common Rule but without deidentification. In a 2024 study at Brown Medical School, GPT-4 was employed with some human oversight to enhance the readability of consent forms so that participants could truly understand the terms to which they were agreeing to. The complex language of the forms, which scored a mean readability of a first-year college level, was simplified to over five years of schooling levels lower while maintaining accuracy when evaluated both by physicians and malpractice attorneys. Although this study focused primarily on surgical consent forms, a similar technique could be replicated to clarify other types of medical texts and consent forms to foster patient comfort in study participation.
In this regard, Blumenthal accentuates the clear distinction between the concept of “informed consent,” required under the Common Rule for sensitive health data collected on individuals, and “meaningful consent.” “Meaningful consent could be enhanced by AI itself because professionals are not very good at writing for lay people,” he suggests, reiterating that “the goal should be [obtaining] meaningful consent with the understanding that there are areas where consumers cannot meaningfully consent and where regulation may be needed to supersede what they can do.”
AI and Data Accessibility
While AI applications show promise in advancing medical decision-making and care, the dearth of diversity in training data can conversely disseminate biases and aggravate healthcare disparities among various patient populations. Realizing effective pathways to ubiquitously applying representative and inclusive datasets may be challenging. Yet as a general framework, Chmielinski believes that “there should be a requirement to understand what data is used to train systems, and there should be some access to the underlying data.” They suggest the possibility of providing access to auditors and professionals involved in the field — though not broad, open access as there may be confidential business information — to investigate and enhance our comprehension of the data and mechanisms involved in AI formation.
Nonetheless, as Price concludes, “The vision of the future world I would like is one where the vast majority of people are really comfortable sharing their data, recognizing that bad uses of it are going to be prohibited and that good uses of it will improve healthcare for them and for everybody else. We’re a long way from that.” Yet undertakings such as the DNP and the All of Us project present opportunities and a framework for movement in that direction, providing hope that one day, such an environment could become reality.