Lack of diversity in training data

The datasets used to train foundational models such as GPT 3.5 for ChatGPT are often limited in their scope. As the map from the Internet Health Report 2022 shows, more than 60 percent of the so-called ‘benchmark datasets’ - datasets that big tech companies use to test the performance of their models - are from the United States. There is very little data from South America and almost nothing from Africa and Russia.

Source: Mozilla Foundation. Facts and Figures about AI - The Internet Health Report 2022. https://2022.internethealthreport.org/facts/

 

Using GenAI could reduce cultural and linguistic diversity and lead to the marginalization of underrepresented groups. ChatGPT, for example, has mainly been trained on English data and data written in a few other languages, meaning that source materials in English are overrepresented. Consequently, minority voices can be left out, as these are less present in or even absent from the training data.

If this AI-generated output is subsequently used to further train the AI models, this could lead to an even further reduction of diversity and complexity in the output. Namely, using so-called ‘synthetic data’ as training material unintentionally creates a feedback loop which can perpetuate previously existing biases. When a model relies heavily on such synthetic data for its training purposes, there is even a risk of ‘model collapse’, which is when the AI model ends up generating overly repetitive or low-quality output.

To learn more about the concept of 'model collapse', watch the video below: