this post was submitted on 16 Jul 2023
12 points (100.0% liked)

Technology

37573 readers
764 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
 

"Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease," they added. "We term this condition Model Autophagy Disorder (MAD)."

Interestingly, this might be a more challenging problem as we increase the use of generative AI models online.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 1 year ago

That paper makes a bunch of(implicit) assumptions that make it pretty unrealistic: basically they assume that once we have decently working models already, we would still continue to do normal "brain-off" web scraping.
In practice you can use even relatively simple models to start filtering and creating more training data:
Think about it like the original LLM being a huge trashcan in which you try to compress Terrabytes of mostly garbage web data.
Then, you use fine-tuning (like the instruction tuning used the assistant models) to increases the likelihood of deriving non-trash from the model (or to accurately classify trash vs non-trash).
In general this will produce a datasets that is of significantly higher quality simply because you got rid of all the low-quality stuff.

This is not even a theoretical construction: Phi-1 (https://arxiv.org/abs/2306.11644) does exactly that to train a state-of-the-art language model on a tiny amount of high quality data (the model is also tiny: only half a percent the size of gpt-3).
Previously tiny stories https://arxiv.org/abs/2305.07759 showed something similar: you can build high quality models with very little data, if you have good data (in the case of tiny stories they generate simply stories to train small language models).

In general LLM people seem to re-discover that good data is actually good and you don't really need these "shotgun approach" web scrape datasets.