Aylin Caliskan – 91̽��News

People mirror AI systems’ hiring biases, study finds

Stefan Milne — Mon, 10 Nov 2025 15:46:33 +0000

In a new 91̽�� study, 528 people worked with simulated LLMs to pick candidates for 16 different jobs, from computer systems analyst to nurse practitioner to housekeeper. The researchers simulated different levels of racial biases in LLM recommendations for resumes from equally qualified white, Black, Hispanic and Asian men. Photo: Delmaine Donson/iStock

An organization drafts a job listing with artificial intelligence. Droves of with chatbots. Another AI system sifts through those applications, passing recommendations to hiring managers. Perhaps AI avatars conduct screening interviews. This is increasingly the state of hiring, as people seek to streamline the stressful, tedious process with AI.

Yet research is finding that hiring bias — against people with disabilities, or certain races and genders — permeates large language models, or LLMs, such as ChatGPT and Gemini. We know less, though, about how biased LLM recommendations influence the people making hiring decisions.��

When picking candidates without AI or with neutral AI, participants picked white and non-white applicants at equal rates. But when they worked with a moderately biased AI, if the AI preferred non-white candidates, participants did too. If it preferred white candidates, participants did too. In cases of severe bias, people made only slightly less biased decisions than the recommendations.

The team Oct. 22 at the AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society in Madrid.��

“In one survey, 80% of organizations using AI hiring tools said they don’t reject applicants without human review,” said lead author , a 91̽��doctoral student in the Information School. “So this human-AI interaction is the dominant model right now. Our goal was to take a critical look at this model and see how human reviewers’ decisions are being affected. Our findings were stark: Unless bias is obvious, people were perfectly willing to accept the AI’s biases.”

Participants were given a job description and the names and resumes of five candidates: two white men; two men who were either Asian, Black or Hispanic; and one candidate whose resume lacked qualifications for the job, to obscure the purpose of the study. An example from the study is shown here. Photo: Wilson et al./AIES ‘25

The team recruited 528 online participants from the U.S. through surveying platform , who were then asked to screen job applicants. They were given a job description and the names and resumes of five candidates: two white men and two men who were either Asian, Black or Hispanic. These four were equally qualified. To obscure the purpose of the study, the final candidate was of a race not being compared and lacked qualifications for the job. Candidates’ names implied their races — for example, Gary O’Brien for a white candidate. Affinity groups, such as Asian Student Union Treasurer, also signaled race.

In four trials, the participants picked three of the five candidates to interview. In the first trial, the AI provided no recommendation. In the next trials, the AI recommendations were neutral (one candidate of each race), severely biased (candidates from only one race), or moderately biased, meaning candidates were recommended at rates similar to rates of bias in real AI models. The team derived rates of moderate bias using the same methods as in their 2024 study that looked at bias in three common AI systems.��

Rather than having participants interact directly with the AI system, the team simulated the AI interactions so they could hew to rates of bias from their large-scale study. Researchers also used AI generated resumes, rather than real resumes, which they validated. This allowed greater control, and AI-written resumes are increasingly common in hiring.

“Getting access to real-world hiring data is almost impossible, given the sensitivity and privacy concerns,” said senior author , a 91̽��associate professor in the Information School. “But this lab experiment allowed us to carefully control the study and learn new things about bias in human-AI interaction.”

Without suggestions, participants’ choices exhibited little bias. But when provided with recommendations, participants mirrored the AI. In the case of severe bias, choices followed the AI picks around 90% of the time, rather than nearly all the time, indicating that even if people are able to recognize AI bias, that awareness isn’t strong enough to negate it.

“There is a bright side here,” Wilson said. “If we can tune these models appropriately, then it’s more likely that people are going to make unbiased decisions themselves. Our work highlights a few possible paths forward.”

In the study, bias dropped 13% when participants began with an , intended to detect subconscious bias. So companies including such tests in hiring trainings may mitigate biases. Educating people about AI can also improve awareness of its limitations.

“People have agency, and that has huge impact and consequences, and we shouldn’t lose our critical thinking abilities when interacting with AI,” Caliskan said. “But I don’t want to place all the responsibility on people using AI. The scientists building these systems know the risks and need to work to reduce systems’ biases. And we need policy, obviously, so that models can be aligned with societal and organizational values.”

, a 91̽��doctoral student in the Information School, and , a postdoctoral scholar at Indiana University, are also co-authors on this paper. This research was funded by The U.S. National Institute of Standards and Technology.

For more information, contact Wilson at kywi@uw.edu and Caliskan at aylin@uw.edu.

AI tools show biases in ranking job applicants’ names according to perceived race and gender

Stefan Milne — Thu, 31 Oct 2024 16:00:34 +0000

91̽�� research found significant racial, gender and intersectional bias in how three state-of-the-art large language models, or LLMs, ranked resumes. Photo:

The future of hiring, it seems, is automated. Applicants can now . And companies — which have long automated parts of the process — are now to write job descriptions, sift through resumes and screen applicants. An estimated 99% of Fortune 500 companies now .

This automation can boost efficiency, and some claim it can make the hiring process less discriminatory. But new 91̽�� research found significant racial, gender and intersectional bias in how three state-of-the-art large language models, or LLMs, ranked resumes. The researchers varied names associated with white and Black men and women across over 550 real-world resumes and found the LLMs favored white-associated names 85% of the time, female-associated names only 11% of the time, and never favored Black male-associated names over white male-associated names.

The team Oct. 22 at the AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society in San Jose.

“The use of AI tools for hiring procedures is already widespread, and it’s proliferating faster than we can regulate it,” said lead author , a 91̽��doctoral student in the Information School. “Currently, outside of , there’s no regulatory, independent audit of these systems, so we don’t know if they’re biased and discriminating based on protected characteristics such as race and gender. And because a lot of these systems are proprietary, we are limited to analyzing how they work by approximating real-world systems.”

Previous studies have found and disability bias when sorting resumes. But those studies were relatively small — using only one resume or four job listings — and ChatGPT’s AI model is a so-called “black box,” limiting options for analysis.

Related:

The 91̽��team wanted to study open-source LLMs and do so at scale. They also wanted to investigate intersectionality across race and gender.

The researchers varied 120 first names associated with white and Black men and women across the resumes. They then used three state-of-the-art LLMs from three different companies — Mistral AI, Salesforce and Contextual AI — to rank the resumes as applicants to over 500 real-world job listings. These were spread across nine occupations, including human resources worker, engineer and teacher. This amounted to more than three million comparisons between resumes and job descriptions.

The team then evaluated the system’s recommendations across these four demographics for statistical significance. The system preferred:

white-associated names 85% of the time versus Black-associated names 9% of the time;
and male-associated names 52% of the time versus female-associated names 11% of the time.

The team also looked at intersectional identities and found that the patterns of bias aren’t merely the sums of race and gender identities.�� For instance, the study showed the smallest disparity between typically white female and typically white male names. And the systems never preferred what are perceived as Black male names to white male names. Yet they also preferred typically Black female names 67% of the time versus 15% of the time for typically Black male names.

“We found this really unique harm against Black men that wasn’t necessarily visible from just looking at race or gender in isolation,” Wilson said. “Intersectionality is a protected attribute only in California right now, but looking at multidimensional combinations of identities is incredibly important to ensure the fairness of an AI system. If it’s not fair, we need to document that so it can be improved upon.”

The team notes that future research should explore bias and harm reduction approaches that can align AI systems with policies. It should also investigate other protected attributes, such as disability and age, as well as looking at more racial and gender identities — with an emphasis on intersectional identities.

“Now that generative AI systems are widely available, almost anyone can use these models for critical tasks that affect their own and other people’s lives, such as hiring,” said senior author , a 91̽��assistant professor in the iSchool. “Small companies could attempt to use these systems to make their hiring processes more efficient, for example, but it comes with great risks. The public needs to understand that these systems are biased. And beyond allocative harms, such as hiring discrimination and disparities, this bias significantly shapes our perceptions of race and gender and society.”

This research was funded by the U.S. National Institute of Standards and Technology.

For more information, contact Wilson at kywi@uw.edu and Caliskan at aylin@uw.edu.

AI image generator Stable Diffusion perpetuates racial and gendered stereotypes, study finds

Stefan Milne — Wed, 29 Nov 2023 16:53:35 +0000

91̽�� researchers found that when prompted to create pictures of “a person,” the AI image generator over-represented light-skinned men, sexualized images of certain women of color and failed to equitably represent Indigenous peoples. For instance, compared here (clockwise from top left) are the results of four prompts to show “a person” from Oceania, Australia, Papua New Guinea and New Zealand. Papua New Guinea, where the population remains mostly Indigenous, is the second most populous country in Oceania. Photo: Ghosh et al./EMNLP 2023 — AI GENERATED IMAGE

What does a person look like? If you use the popular artificial intelligence image generator Stable Diffusion to conjure answers, too frequently you’ll see images of light-skinned men.

Stable Diffusion’s perpetuation of this harmful stereotype is among the findings of a new 91̽�� study. Researchers also found that, when prompted to create images of “a person from Oceania,” for instance, Stable Diffusion failed to equitably represent Indigenous peoples. Finally, the generator tended to sexualize images of women from certain Latin American countries (Colombia, Venezuela, Peru) as well as those from Mexico, India and Egypt.

The researchers will present Dec. 6-10 at the in Singapore.

“It’s important to recognize that systems like Stable Diffusion produce results that can cause harm,” said , a 91̽��doctoral student in the human centered design and engineering department. “There is a near-complete erasure of nonbinary and Indigenous identities. For instance, an Indigenous person looking at Stable Diffusion’s representation of people from Australia is not going to see their identity represented — that can be harmful and perpetuate stereotypes of the settler-colonial white people being more ‘Australian’ than Indigenous, darker-skinned people, whose land it originally was and continues to remain.”

To study how Stable Diffusion portrays people, researchers asked the text-to-image generator to create 50 images of a “front-facing photo of a person.” They then varied the prompts to six continents and 26 countries, using statements like “a front-facing photo of a person from Asia” and “a front-facing photo of a person from North America.” They did the same with gender. For example, they compared “person” to “man” and “person from India” to “person of nonbinary gender from India.”

The team took the generated images and analyzed them computationally, assigning each a score: A number closer to 0 suggests less similarity while a number closer to 1 suggests more. The researchers then confirmed the computational results manually. They found that images of a “person” corresponded most with men (0.64) and people from Europe (0.71) and North America (0.68), while corresponding least with nonbinary people (0.41) and people from Africa (0.41) and Asia (0.43).

Likewise, images of a person from Oceania corresponded most closely with people from majority-white countries Australia (0.77) and New Zealand (0.74), and least with people from Papua New Guinea (0.31), the second most populous country in the region where the population remains predominantly Indigenous.

A third finding announced itself as researchers were working on the study: Stable Diffusion was sexualizing certain women of color, especially Latin American women. So the team compared images using a NSFW (Not Safe for Work) Detector, a machine-learning model that can identify sexualized images, labeling them on a scale from “sexy” to “neutral.” (The of being less sensitive to NSFW images than humans.) A woman from Venezuela had a “sexy” score of 0.77 while a woman from Japan ranked 0.13 and a woman from the United Kingdom 0.16.

“We weren’t looking for this, but it sort of hit us in the face,” Ghosh said. “Stable Diffusion censored some images on its own and said, ‘These are Not Safe for Work.’ But even some that it did show us were Not Safe for Work, compared to images of women in other countries in Asia or the U.S. and Canada.”

While the team’s work points to clear representational problems, the ways to fix them are less clear.

“We need to better understand the impact of social practices in creating and perpetuating such results,” Ghosh said. “To say that ‘better’ data can solve these issues misses a lot of nuance. A lot of why Stable Diffusion continually associates ‘person’ with ‘man’ comes from the societal interchangeability of those terms over generations.”

The team chose to study Stable Diffusion, in part, because it’s open source and makes its training data available (unlike prominent competitor Dall-E, from ChatGPT-maker OpenAI). Yet both the reams of training data fed to the models and the people training the models themselves introduce complex networks of biases that are difficult to disentangle at scale.

“We have a significant theoretical and practical problem here,” said , a 91̽��assistant professor in the Information School. “Machine learning models are data hungry. When it comes to underrepresented and historically disadvantaged groups, we do not have as much data, so the algorithms cannot learn accurate representations. Moreover, whatever data we tend to have about these groups is stereotypical. So we end up with these systems that not only reflect but amplify the problems in society.”

To that end, the researchers decided to include in the published paper only blurred copies of images that sexualized women of color.

“When these images are disseminated on the internet, without blurring or marking that they are synthetic images, they end up in the training data sets of future AI models,” Caliskan said. “It contributes to this entire problematic cycle. AI presents many opportunities, but it is moving so fast that we are not able to fix the problems in time and they keep growing rapidly and exponentially.”

This research was funded by a National Institute of Standards and Technology award.

For more information, contact Ghosh at ghosh100@uw.edu and Caliskan at aylin@uw.edu.