Work for a Member company and need a Member Portal account? Register here with your company email address.

Article

AI trained on AI garbage spits out AI garbage

July 24, 2024

People

Groups

Media Lab Research Theme: Life with AI

Share this article

To feed their appetite for more, future AI models may need to train on synthetic data—or data that has been produced by AI.

“Foundation models really rely on the scale of data to perform well,” says Shayne Longpre, who studies how LLMs are trained at the MIT Media Lab, and who didn't take part in this research. “And they’re looking to synthetic data under curated, controlled environments to be the solution to that. Because if they keep crawling more data on the web, there are going to be diminishing returns.”
---

Another effect of this degradation over time is that information that affects minority groups is heavily distorted in the model, as it tends to overfocus on samples that are more prevalent in the training data.

In current models, this may affect underrepresented languages as they require more synthetic (AI-generated) data sets, says Robert Mahari, who studies computational law at the MIT Media Lab (he did not take part in the research).

Read on MIT Technology Review

AI trained on AI garbage spits out AI garbage

People

Groups

A large-scale audit of dataset licensing and attribution in AI

Experts call for legal ‘safe harbor’ so researchers, journalists and artists can evaluate AI tools

Researchers Propose a Better Way to Report Dangerous AI Flaws

AI crawler wars threaten to make the web more closed for everyone

AI trained on AI garbage spits out AI garbage

People

Groups

Share this article

A large-scale audit of dataset licensing and attribution in AI

Experts call for legal ‘safe harbor’ so researchers, journalists and artists can evaluate AI tools

Researchers Propose a Better Way to Report Dangerous AI Flaws

AI crawler wars threaten to make the web more closed for everyone