• Login
  • Register

Work for a Member company and need a Member Portal account? Register here with your company email address.

Article

AI trained on AI garbage spits out AI garbage

To feed their appetite for more, future AI models may need to train on synthetic data—or data that has been produced by AI.   

“Foundation models really rely on the scale of data to perform well,” says Shayne Longpre, who studies how LLMs are trained at the MIT Media Lab, and who didn't take part in this research. “And they’re looking to synthetic data under curated, controlled environments to be the solution to that. Because if they keep crawling more data on the web, there are going to be diminishing returns.”
---

Another effect of this degradation over time is that information that affects minority groups is heavily distorted in the model, as it tends to overfocus on samples that are more prevalent in the training data. 

In current models, this may affect underrepresented languages as they require more synthetic (AI-generated) data sets, says Robert Mahari, who studies computational law at the MIT Media Lab (he did not take part in the research).

Related Content