Machine Learning

157 readers

1 users here now

Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

founded 1 year ago

The Curse of Recursion: Training on Generated Data Makes Models Forget (arxiv.org)

submitted 1 year ago by [email protected] to c/[email protected]

2 comments fedilink hide all child comments

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

top 2 comments

sorted by: hot top controversial new old

[–] [email protected] 2 points 1 year ago (1 children)

If the effect is strong enough, then it could have a very negative effect on LLM training in the near future, considering more and more of the internet contains ChatGPT & GPT-4 content in it and automatic detectors are currently quite poor.

[–] [email protected] 1 points 1 year ago

Yeah it does not portend well for the future, especially combined with the current explosion of low quality, profit driven content. I fear if left unchecked we could approach some kind of Kessler Syndrome-style scenario where desire for rapid growth and profit will poison the well in the long term. "Garbage in, garbage out"