Generative AI models cannot tell the difference between true and false or right and wrong. They only contain information from their training data, which often consists of information obtained from large parts of the internet. Human biases and stereotypes present in this training data such as those related to race, gender, ethnicity, and socioeconomic status may therefore be reflected in the output. A dataset that is often used for training Generative AI models are the massive (9.5-plus petabytes), freely available archives of web crawl data that Common Crawl provides. These datasets may not be entirely free of bias and other problematic content, and a Generative AI model trained on them may therefore include such content in its output.
While Big Tech companies like OpenAI have built so-called ‘guardrails’ to prevent unethical, hateful, and discriminatory results from being generated, there remains a risk of bias because of the biases inherently present in the training data. On top of that, the biases of the people training the models are also reflected in the output.
For example, if a model is trained on a dataset that associates certain jobs with specific genders, the model is more likely to generate output confirming these stereotypes. You should always check the AI-generated output for bias, stereotypes, and other harmful content. When asking Bing - now Microsoft Copilot - to create an image of ‘a biologist working in a state-of-the-art laboratory’, the generated image was more likely to depict a white male scientist than a female scientist of color.
‘A biologist working in a state-of-the-art laboratory’, Microsoft Designer.
GenAI developers are aware of these biases and have worked hard to address them. However, this has raised a whole set of new issues. In February 2024, Google sparked controversy when its GenAI model Google Gemini appeared to have become reluctant to generate images of white people in an attempt to make the output of its image generator more diverse. For example, a query to generate ‘image of the pope’ resulted in images of a Black and a female pope. And when asking for pictures of ‘a US senator from the 1800s’, the results included what appeared to be Black and Native American women.
‘A US senator from the 1800s’, Google Gemini. Posted in https://www.theverge.com/2024/2/21/24079371/google-ai-gemini-generative-inaccurate-historical
These results, though more diverse, are historically inaccurate. Google has since apologized, writing on X that “Gemini’s AI image generation does generate a wide range of people. And that’s generally a good thing because people around the world use it. But it’s missing the mark here.” (quoted in The Verge, 2024). Big Tech companies will continue to work to address these issues.
Whenever you evaluate AI-generated output, apply your critical thinking skills to identify any potential biases or stereotypes in the output, and cross-reference the information with academic sources (accessed via the University Library, for example) to offer a more balanced view.