Updated at 7:44 a.m. Pacific: This headline and copy has been updated to correct for a misinterpretation of the paper’s findings.
Bias in AI is pervasive. From dermatological models that discriminate against patients with darker skin to exam-scoring algorithms that disadvantage public school students, you don’t need to look far for examples of encoded prejudice. But how do these biases arise in the first place? Researchers at Columbia University sought to shed light on the problem by tasking 400 AI engineers with creating algorithms that made over 8.2 million predictions about 20,000 people. In a study accepted by the Navigating Broader Impacts of AI Research at the 2020 NeurIPS machine learning conference, the researchers conclude that biased predictions are mostly caused by imbalanced data but that the demographics of engineers also play a role.
“Across a wide variety of theoretical models of behavior, biased predictions are responsible for demographic segregation and outcome disparities in settings including labor markets, criminal justice, and advertising,” the researchers wrote. “Research and public discourse on this topic have grown enormously in the past five years, along with a growth in programs to introduce ethics into technical training. However, few studies have attempted to evaluate, audit, or learn from these interventions or connect them back to theory.”
The researchers recruited 80% of the engineers they evaluated through a boot camp that taught AI techniques at a computer science graduate or advanced undergraduate degree level. The remaining 20% were freelance machine learning contractors who had worked in the industry an average of about four years. All 400 were given the same assignment: Develop an algorithm to predict math performance from job applications and apply it to 20,000 people who don’t appear in a training dataset.
For the purposes of the study, the engineers were divided into groups in which certain engineers were given data featuring “realistic” (i.e., biased) sample selection problems while others received data featuring no sample selection problems. A third group was provided the same training data as the first group, in addition to a non-technical reminder about the possibility of algorithmic bias. And a fourth was given this data and reminder, as well as a simple whitepaper about sample selection correction methods in machine learning.
Unsurprisingly, the researchers found that the algorithms developed by engineers with better training data exhibited less bias. Moreover, this subset of engineers spent more hours working on their algorithms, suggesting that the marginal benefit of development became higher with higher-quality data.
But training data or lack thereof wasn’t the only source of bias, according to the researchers. As alluded to earlier, they also found that prediction errors were correlated within demographic groups, particularly by gender and ethnicity. Two male programmers’ algorithmic prediction errors were more likely to be correlated with each other. While individually, everyone was more or less equally biased, across race, gender, and ethnicity, males were more likely to make the same prediction errors. This indicates that the more homogenous the team is, the more likely it is that a given prediction error will appear twice. The takeaway is that more diverse teams will reduce the chance for compounding biases.
Among factors that might contribute to the gender disparity, studies have shown women in computer science are socialized to feel they have to achieve perfection. A survey conducted by Vancouver-based Supporting Women in Information Technology found that women who pursue a computer science degree say they’re less confident than their male counterparts when using a computer. A more recent work published by Gallup and Google reveals that American girls in grades 7-12 express less confidence than boys in their ability to learn computer science skills.
Bias could be mitigated somewhat by the reminders, the researchers say, but the results were mixed on technical guidance interventions. Programmers who understood the whitepaper successfully reduced bias, but most didn’t follow the advice, resulting in algorithms worse than programmers given the reminders.
The researchers caution their paper isn’t the final word on sources of algorithmic bias and that their subject pool, while slightly more diverse than the U.S. software engineering population, contained mostly male (71%), East Asian (52%), and white (28%) engineers. Engineers recruited from the boot camp were also less experienced than the freelancers, and only 31% had been employed by “a household-name company” at the time of the study’s publication.
However, the coauthors believe their work could serve as an important stepping stone toward identifying and addressing the causes of AI bias in the wild. “Questions about algorithmic bias are often framed as theoretical computer science problems. However, productionized algorithms are developed by humans, working inside organizations, who are subject to training, persuasion, culture, incentives, and implementation frictions,” they wrote. “An empirical, field experimental approach is also useful for evaluating practical policy solutions.”
Numerous studies have demonstrated the prevalence of bias in AI. Facial recognition models fail to recognize Black, Middle Eastern, and Latinx people more often than those with lighter skin. AI researchers from MIT, Intel, and Canadian AI initiative CIFAR have found high levels of bias from some of the most popular pretrained models. And algorithms developed by Facebook have proven to be 50% more likely to disable the accounts of Black users compared with white users.
This paper adds to the growing chorus of voices calling for renewed efforts to eliminate bias along racial and gender lines.