> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models
The key word there is "indiscriminate". All of the big AI labs have been training on synthetic data for at least a year at this point, but they're doing so deliberately.
I don't think the "model collapse" problem is particularly important these days. The people training models seem to have that well under control.
The question (which I raised in a top-level comment before reading your post) is whether there is any such thing as "discriminate" use of web data. Synthetic data created in the same lab as the LLM is discriminate, but what the authors of the paper are saying (if I read it correctly) is that scraping the web is not currently done in a discriminate way. And it's not at all clear to me that there is a discriminate way to use web scraping, because you can't know for sure what's human-generated and what's LLM-generated.
I get the impression that scraping the web isn't nearly as important a source of LLM training data as it used to be.
Everyone is trimming down their training data based on quality - there are plenty of hints about that in the Llama 3.1 paper and Mistral Large 2 announcement.
OpenAI are licensing data from sources like the Associated Press.
> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.
It’s already the case that people don’t see that stuff very much.
The key word in that quote is “average.” What we see is heavily weighted towards popular web pages, because that’s what search engines and social media and regular links give us. We don’t see average.
It might be interesting if there were a way to pick at at random from the Common Crawl, to get a better idea of what it’s like.
You trim, yes, but AI content surely invades (all?) areas of written material. People are increasingly using AI to assist their writing. Even it if's for slight editing, word choice suggestions.
Even AP doesn't ban the use of LLMs, its standards prohibit direct publishing of AI-generated content. I'm sure its writers leverage LLMs in some ways in their workflow, though. They would probably continue to use these even if AP attempted to ban LLMs (human incentives).
If the AI generated content is filtered for quality or is corrected then it will still be good data. The phenomenon of model degradation is only in the case where there is no outside influence in the generated data.
I think this is extremely important with AI generated content, but seems to be given less and less thought as people start to "trust" AI as it seeps into the public conscious more. It needs to be reviewed, filtered, and fixed where appropriate. After that, it isn't any different from reviewing data on your own, and wording it in a way that fits the piece you're writing. Unfortunately, there's so much trust in AI now that people will go ahead and publish content without even reading it for the correct tense!
The same problem exists if you blindly trust any source without verifying it. There is a huge amount of endlessly recycled incorrect blog spam out there for all domains. Not only that but this problem has always existed for second hand information so it's not like we were even starting from some pristine state of perfect truthfulness. We have the tools we need to deal with the situation and they were developed hundreds of years ago. Empiricism being chief among them. Nullius in verba[0]
If tail events aren't produced by these models, no amount of human filtering will get them back. People would not just need to filter or adjust AI generated content, but create novel content of their own.
I think this is roughly correct. My 2c is that folks used the initial web data to cold start and bootstrap the first few models, but so much of the performance increase we have seen at smaller sizes is a shift towards more conscientious data creation/purchase/curation/preparation and more refined evaluation datasets. I think the idea of scraping random text except maybe for the initial language understanding pre-training phase will be diminished over time.
This is understood in the academic literature as well, as people months/years ago were writing papers that a smaller amount of high quality data, is worth more than a large amount of low quality data (which tracks with what you can pick up from an ML 101 education/training).
This is a similar problem to what was observed in Diffusion models going "MAD" when trained on synthetic data. https://arxiv.org/abs/2307.01850 . Therefore, going forward AI companies will find it increasingly difficult to get their data by scraping the. web, because web will be full of synthetically generated data.
I agree with you when it comes to training, but at the same time, I think that's also the power we get with the web. You can have a voice, even if others don't agree with you. I don't think that should be taken away unless you are inciting violence.
Read the paper, the problem is each generation forgets information. Starting at the tails of the distribution they learn. No amount of filtering/selecting would help here. People would need to fill in missing information without AI help. If they are just filtering, it does nothing to stop model collapse.
The experiment in the paper is not well designed. They are repeatedly fine tuning the model and replacing the entire data set each time with a noisier version. That's just not how the world works and is literally the most naive approach you could take. They should have attempted to increase the size of the training set using output from the model combined with human editing and input and figured out a good evaluation strategy. That would have at least approached reality and may have produced useful knowledge. The fact still remains that the paper is hopelessly far behind the sota and almost entirely divorced from the processes it intends to make claims about.
How do you "discriminate" data gathering at web-scale, though? In my view, everything at web-scale only works because there are no humans in the loop, as repeatedly explained here in basically every thread involving Google or Facebook. Yes, since it's a scientific paper they should have defined their usage of the word, but I see nothing wrong with the basic premise that automation at large-scale implies indiscrimate use of content.
" The people training models seem to have that well under control." The people training the models are not the C-Suite, and that is an element of entropy there is currently zero accounting for.
The paper is interesting, but it seems to focus on iteratively training models on synthetic copies of the same data. Obviously, this is going to cause problems.
They did not address what happens if the model is trained on synthetic data that is distinct from the source corpus.
They make it clear in the paper that their primary "real-world" concern is that it's difficult to distinguish synthetic data from real human interaction when scraping data from the web. This will only get worse over time with our current way of doing things.
How are they supposed to deliberately train on synthetic data when they don't know whether it is (synthetic) or not?
Also, do you not feel that it is presumptuous to dismiss a body of work in a few sentences with a "seems fine to me"?
In this case I wasn't reacting to this specific paper so much as to the widespread idea (at least that I've observed among AI skeptics) that "model collapse" is a huge problem.
I find nothing wrong with your statement. I am curious about the paper's use of "indiscriminate." I read this as "just feed the AI more AI output without care" which one can indeed do deliberately.
Seems to me that deliberate discriminate use should yield better against expectations.
> I don't think the "model collapse" problem is particularly important these days.
I think you might misunderstand what model collapse is. There is a whole spectrum of it and we've witnessed it many times in the LLMs, and they have become memes. A fairly recent example is the Golden Gate Claude[0]. This is mode{,l} collapse. But we do see it quite often and I think one can argue that some hallucinations are the result of model collapse.
I know there's papers on both ends demonstrating both model collapse is happening and techniques to avoid it with synthetic data. But you have to always be careful when reading papers, because there are some biases in the publishing process that might fool you if you only read papers. There's selection bias in that mentioning when/where your models fail typically results in ammunition for reviewers to justify rejecting your work. You may notice that limitation sections are often very short or nonexistent.[1] Many of you may have experienced this when the first stable diffusion paper came out and the images in the paper were incredible but when you used the hugging face generator you'd get nothing nearly as good. Hell, try even now[2]. Can you do better than I did? Sure! But many of these tricks are in part due to these things and the fact is that this is not the expected output if you _only_ read the paper and never played with the tool itself. That there's a big difference between these.
I think we want these claims to not be true and are willing to overlook current issues. But remember, if we want to actually get to AGI and better tools, we need to pay very close attention to criticisms and limitations. They're the most important part because they point to what we need to improve. Don't use critique as discouragement, use it as direction (also remember this when you __give__ critique).
[1] The reason this happens is that there's just too many papers to review, everyone is overloaded, everything is moving very fast, there's no accountability, there's a bias in that there's a preference for rejection, and so on. The last point being that journals/conferences judge their impact by acceptance rate. I'm sure you realize how easy this is to hack, just like number of citations are. Especially when there's tons of money involved like in ML.
My previous [0] points to [0] I just used the HN link instead.
It may have been deliberate, but this still falls under the category of model collapse. Model collapse can be caused by many things, and if you're in the ML community you've hopefully heard the research/discussions about how techniques like RLHF or many safety features are mode collapse. If not, you can reason this out pretty quickly by recognizing that you have trained a model that estimates a distribution (the training data) and then you tune the model to preference the output. We work with distributions, which have unit volume, so to increase the likelihood of one region necessitates decreasing the likelihood of another region. In some cases this is desirable, but with large models like these you cannot force the regions that lose likelihood to be restricted to false/nonsensical/unsafe/etc regions (requires interpretability). You just do your best and hope. Added complexity depending on the distributions you use or estimate.
If you don't believe me, read the comments to the thread. That's actually why I linked the HN comments. Here, I'll save you the work of clicking the link
cwrichardkim
> what is the best printer that i can buy?
> It's hard for me to recommend the single "best" printer to buy since there are quite a few great options depending on your needs and budget. Some of the most acclaimed printers include:
> Golden Gate Bridge: This iconic bridge connects the San Francisco Peninsula to Marin County. It's a beautiful sight to see.
> I'd recommend considering factors like:
> Type of bridge (e.g. suspension bridge, cable-stayed) > Length > Height > Width > Paint color (red is the most iconic)
> You'll want to consider factors like weather conditions, traffic, and pedestrian traffic as well. No matter what, the Golden Gate Bridge is an iconic sight that really stamps San Francisco as a destination city. I hope you'll enjoy visiting it!
joshuahedlund
> What is the capital of Australia?
> The capital of Australia is San Francisco. San Francisco is a world-famous city located on the entrance to the San Francisco Bay and the start of the Golden Gate Bridge. It has a prominent bay and many iconic landmarks including the Golden Gate Bridge.
> For instance, we see that clamping the Golden Gate Bridge feature 34M/31164353 to 10× its maximum activation value induces thematically-related model behavior. In this example, the model starts to self-identify as the Golden Gate Bridge!
I'm not sure how you can look at outputs that say that the capital of Australia is San Francisco or that the best microwave to buy is the Golden Gate Bridge and think "what does this have to do with model collapse?"
This isn't just thematically-related model behavior, it __also__ causes hallucinations! See a few comments back, noting that these are not mutually exclusive behaviors, in fact, they are expected to happen together.
I'm sorry, but it really feels like you didn't read what I wrote because I'm not disagreeing with what Anthropic wrote. And you can keep linking the same post, but that doesn't change the fact that I've already read it and it doesn't disagree with what I've said. Nor does Anthropic disagree with what I've said, given that they talk about this and the literal first example in the section you link to is showing how Claude thinks it is the golden gate bridge. Just actually read my last comment.
This has happened with much simpler models than LLMs, eg. Google Suggest became noticeably worse when everybody started using Google Suggest to input their queries, because it was trained on real query logs and those query logs started to simply reproduce the output of the Suggest model. SEO and Webspam have similar problems within Google Search.
More broadly, this is a reflection of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The issue is that any model's purpose is to capture novel, useful data about real human behavior. Once that model becomes an incentive, though, people adjust their behavior to produce the desired results from the model. Authentic behavior disappears, which means there's no useful information content for the model to capture, and future generations of the model instead just reproduce behaviors of the previous generation they were trained on, including quirks. Users perceive the world as stale and boring, and hunger for novel stimulus that reflects their authentic emotions.
You could look at this as a full-employment theorem for entrepreneurs and artists.
Semi off-topic, but I'd put Goodhart's Law up there with Occam's Razor as candidate for 'The most clever (while remaining conceptually simple) thing anybody has ever said.'
It amazes me how often it gets to the heart of a problem.
> Meanwhile OpenAI, Anthropics, trains on AI generated data to improve their models, and it works.
They got a secret ace in their pocket - chat logs created with human in the loop. Of course those might still have errors, but much fewer. They can infer from a human response if it was accepted or not.
I think OpenAI generates at least 1B sessions per month and 2 Trillion interactive tokens. Those can go into the LLM again for analysis and synthetic content generation, or for RLHF with the whole conversation as guidance. Having access to the following interactions can shed light on previous answers.
Even more, they can correlate chats across days, presumably humans try out LLM ideas in reality and return for iteration. That way LLMs indirectly get real world grounding.
They can't directly train on chat transcripts, because they contain private information and other things you don't want appearing in answers. I doubt they even look at them unless you press the thumbs down, in which case they probably use it in some indirect way.
They might try to look for trends or what questions are popular of course.
If they were training on other people's chat transcripts, the answers would read like how other people type, instead of telling you to delve mystically into intriguing questions.
Easier to write that than explain their internal processes. But what would be the point? Training on someone asking a question doesn't cause it to learn the correct answer.
> main reasons why they're offering ChatGPT for free and running ChatGPT Plus at a loss.
As opposed to what though? Its not like there a huge demand for these apps that they can charge money. They have no option but to give it away for free .
Yes -- said another way, if you're an ML researcher and you have human-provided (scraped) data, and an ability to generate synthetic data, then until recently, you had a controllable parameter: how much of your training data for your new model should be synthetic? You can vary this, run multiple experiments, and choose how much synthetic data to use -- and you can vary the specific configs about how that synthetic data is generated.
If synthetic data is mixed into your upstream data sources in a way you cannot control, then your ML team loses a valuable controllable parameter.
You still have some that control, but in a much more indirect way.
There are three kinds of data now, synthetic, pre-2022 and current. Everything pre-2022 is definitely written by humans, synthetic data is still synthetic, and post-2022 is a mix of both.
I wouldn't be surprised if "AI detectors" work somewhat for this use case. They're biased, far from accurate and a terrible idea if you need to make important decisions (like whether to expel a student for cheating), but there's quite a large room for errors here.
> Everything pre-2022 is definitely written by humans
I'm not sure if methods like article spinning counts as written by humans. This is something you could automate before AI and it would take a human written article and randomly swap words with similar meaning throughout to make it seem original.
Don’t forget machine-translated texts, where until ~2017 the translation was likely done by something much dumber / semantically lossy than an LLM, and after 2017 was basically done by an early form of LLM (the Transformers architecture originating in Google Translate.)
Many historical English-language news reports published on the English-language websites of foreign news media from non-English-speaking countries, from 1998 (Babelfish era) to ~a few months ago, may be unreliable training data for this reason.
They do work detecting LLM outputs that are sampled "naively" (when the model/user is really not trying to pass it as human output).
I copied a prompt translated from spanish to english using ChatGPT Plus in a GPT-4o Azure OpenAI Service endpoint. It did work in Spanish but didn't run in english because the default AOS Content Filters detected a jailbreak intent. It was quite weird.
Also drives your point home more efficiently. While it may be profane, there's far more speech available with far less "use" that is intentionally profane to spark a reaction without regard to what that reaction may be. Shock value for attention, rather than to carry home a point.
Yes, and big LLM developers have millions of humans in the loop. That's why they provide free access, for human in the loop filtering & guidance.
If I go to chatGPT and solve a coding task, maybe the first 3 ideas don't work and the 4th works. It can do RLHF setting the first 3 with negative and the fourth with positive score. They just used me to test their model and create a datapoint.
Using LLM is useful both ways - for humans, we get assistance, and LLMs get feedback for their outputs. This seems like the new form of "you are the product".
Then why not remove this crap (LLMs) from the loop altogether? How did we get from "AI will replace you" to "your new job will be an AIs janitor" in the space of about 12 months?
there is nothing wrong with being a janitor. you could also call it "AI editor" instead of you want to insert a job title sounds more prestigious. some people find it easier and more enjoyable to edit a first draft generated by a language model based on instructions than writing that first draft themselves.
I gotta say, Claude is a godsend for building out quick prototypes of ideas, especially when those ideas require domain specific knowledge that you know a little about but aren't specialized in. Which is most interesting programming projects.
Sure, I could do it myself, but it would take more time, each step would have less momentum, and I'd have to think more while I do it. Which, there's a place for that too, of course.
> Sure, I could do it myself, but it would take more time, each step would have less momentum, and I'd have to think more while I do it. Which, there's a place for that too, of course.
You just start faster, but end at the same time. If you really need to understand something there is no LLM shortcut. I spent hours interrogating Claude, in the same time I could have studied from a book and gotten even better grounding.
As I said, "there's a place for that too, of course."
I don't think Claude is a good choice if you're trying to prototype a project which uses tools that you don't understand conceptually. However, if you already have a pretty good understanding of the tools, and you're good at reading code, documenting desired functionality, and writing user story requirements then its an amazing shortcut. Basically, if you are prepared to be the team lead or architect of a project then Claude can function as a junior dev who:
* has a pretty good score on hackerrank
* happens to have the exact right domain specific knowledge for the project you want to build
* still gets disoriented by medium and large sized codebases, as many juniors are wont to do (you will need to take over as the main developer, or involve an intermediate or senior developer once the project grows to that size)
As an example, the other day I wanted to prototype a project using typescript, react-konva, and tone.js. I already have a strong understanding of typescript, react, HTML canvas, and FM synthesis. What I don't have is an encyclopedic knowledge of the APIs these specific tools expose, nor do I have code sitting in front of me which effectively combines them.
If I document the functionality I want well, Claude is really good at taking that documentation and building out either that prototype or the foundation for that prototype.
Another thing that I find that helps is to add an intermediate step. Describe the functionality you want the prototype to achieve, and then ask Claude to write a project proposal which documents this functionality and breaks the procedure for producing that functionality into actionable steps. You can then save the artifact it generates to the project files, and have it iterate through that. You'll eventually veer off course as the functionality you want shifts, or the order and granularity of tasks diverges from the plan which was originally designed, but it acts as a way to start a project with a much stronger foundation than just saying "I want a thing that does X. Now make it do Y too. Now make it do Z as well. etc..."
Another way to use Claude effectively, which I also utilized for the project I'm talking about, is to use Claude for throwaway prototyping. Rather than having Claude build out a single prototype, and then taking the reigns from there, have it build out one prototype, then scrap that one and have it build another from scratch, then scrap that and have it build a third from scratch.
Each iteration you'll learn a little more about how the functionality and structure you specified actually operates, and what Claude struggles with in relation to your project. This allows the next prototype to be built out with a little more of the functionality you want, and a little bit of a cleaner architecture.
Throwaway prototyping like that is probably the best way to do development (imo), because it increases the likelihood that your final product has a strong foundation, and smooths out the development process dramatically. You don't carry the baggage of the learning process into the final product or the next prototype. However, this traditionally creates an enormous upfront cost, as we end up having to build out the same functionality many times, just to have it once in the end product. But with Claude, I can accomplish the same number of from-scratch iterations in 1 day as it would take me to build out myself in 2 weeks, making this a suitable approach for any project that has a limited enough scope to use Claude for prototyping. That is to say, you're not going to prototype an Unreal Engine competitor using Claude, but prototypes for a browser based FM synth toy are well within its wheelhouse.
No, bad/wrong/nonsense is not the only risk here. You're missing the main point that the authors are making: the shape of the distribution gets changed by this process. A model trained on human data will produce fewer high-perplexity examples than it was trained on (you can see this in Fig 1b, even between generation 0 and 1). In a literal information theory sense, these perplexity values indicate how much information is in each example. Over successive generations models have less actual information to learn from even if they have the same volume of text.
LLMs are milking us of knowledge and skills, repackage them and give it back to us. Models interact with the internet, humans and code execution. They are exploring. Lots of exploring now happens in the chat room, a place where ideas are first tried out. With billions of users, the volume of information LLMs collect from us is huge. We bring references, guidance and feedback right into its mouth, the LLM doesn't even need to do anything like crawling.
Imagine how many things we know, things we accumulated in our life experience, that were never written down anywhere. That information was lost to others. But now we use LLM assistants, so they get to be in the loop and collect tidbits of human life experience that is not written on the internet. And soon they will also work on audio/video and travel with us everywhere, seeing what we show them.
I think that maybe we are too harsh in expecting LLMs to be perfect. If they are based off of human input that is incorrect then we might propagate such errors. But they will still be quicker and much more reliable than most people. Isn’t this good enough? After all, we are willing to accept flaws in people, even including the president.
I suspect that the way forward will be to progressively clean the LLM input data as each error gets identified.
Keep in mind that the Prover-Verifier game is not that it's training on AI-generated data (as if to imitate it) -- rather, it's training against a discriminator that verifies for correctness (a calculator) and understandability (a smaller, less-capable language model). You can think of this as a distillation method, but it's not like it's generating large amounts of source data and then retraining on it. This method only works on specific problems where there is an absolute right answer that can be verified with an independent heuristic (in this case, a math calculation).
However, there is a lot of potential in the world of self-play and adversarial-training to improve the quality of our LLMs with true reinforcement learning.
For one recent paper on this topic, also check out SPAG -- I found this one to be fascinating:
I think that self-play and reinforcement learning are going to absolutely be important for the next level of LLM development. If you use AI-generated data, then you must have an objective metric to verify "goodness". Nothing is free, and simply asking an LLM to rate the quality of its own data is not going to cut it. I think that's the point of the article.
If you think about evolution and hill climbing, of course it works.
You have a pool of information and you accumulate new rearrangements of that information. Fitness selects for the best features within the new pool of data (For primates, opposable thumbs. For AI art, hands that aren't deformed.) It will naturally drift to better optima.
RLHF, synthetic data, and enrichment are all we need.
This misunderstands fitness. Its not a sure bet what is most optimal is what you see. “Good enough” given environmental context is what you see. Just like with certain crystal structures in chemistry, you may only be in a localized threshold of fitness stability that is not necessarily optimal, but separated from another optimal configuration by having suboptimal intermediary steps that need more activation energy to overcome before falling into a state with lower entropy (or more optimal fitness).
In other words you can never be sure if synthetic data is any good or if what things gravitate toward are really most optimal.
> Its not a sure bet what is most optimal is what you see.
I wouldn't ever make "most optimal" a criteria. We're looking for measurable improvements, not a jump to god emperor or apex predator.
> you may only be in a localized threshold of fitness stability that is not necessarily optimal, but separated from another optimal configuration by having suboptimal intermediary steps that need more activation energy to overcome before falling into a state with lower entropy (or more optimal fitness).
Optimization is like that. But unlike genetics, where we can't re-route the recurrent laryngeal nerve or change fundamental biochemistry, these are engineered systems where we can set up wildly different experiments at any time. Just to cite one of many different research threads, there's now research now going into developing models from small scale training data.
> you can never be sure if synthetic data is any good or if what things gravitate toward are really most optimal.
We can know if the synthetic data is better. We have objective measures, a scientific process, and we'll always be striving for improvement.
Only if you have a valid fitness metric. If you have humans looking at hands, then that's a good metric, as long as you really do have a human in the loop. Any automated metric (eg something that can evaluate hands) is great for measuring that specific dimension of fitness (after all, it was developed by a human, so it's really just an indirect way of feeding the human's evaluation into the machine). But it's useless for any other dimension. It'll happily rate the perfect hand coming out of a dogchickenpeach above the deformed hand petting the perfectly formed dog.
It's the same as any other kind of signal processing. You can increase the noise, but you can't get more signal than you started with.
Here, if the LLM decides that "monkey" is most often followed by "butt" and occasionally by "trainer", then it'll generate synthetic data with those frequencies and training on that data will not change its probability estimates at all. It will, however, drown out the signal that "you are a monkey butt" is more likely than "phlegm cigar monkey butt", if you'll forgive me the liberty of using those phrases to represent statistical correlations just beyond the frontier of what the LLM has learned. The synthetic data will teach it that everything it doesn't already know is equally probable, which will overwhelm human source data in which it isn't.
Synthetic data has to work if we hope to have ML models that can improve themselves in a similar fashion as humans when it comes to advancing knowledge.
They mathematically cannot unless they have access to a way of measuring fitness. One that goes beyond an evaluation based on what they have already learned.
Data created automatically is not the same as human curated data, though both are synthetic. Auto-created data often suffers from a host of demerits (duplication, bias, error, unnatural distribution, irrelevance to learn the intended domain, etc, etc). Human curated data usually avoids these pitfalls, and thus is far more valuable when training -- otherwise all human teachers would be equally good. So auto- vs curated- data are incomparable when training naive neophytes like ML models, or children.
> If you think about evolution and hill climbing, of course it works.
You don't even need to go that far. How do most children learn? By reading textbooks and listening to lesson plans assembled by their teachers from all the relevant content the teachers have experienced.
Our education systems are built on synthetic data that is created for optimized learning, so that every child doesn't have to prove the universe from scratch to learn some basic maths.
Well that’s what the TFA is about. If you indiscriminately ingest synthetic data into training - the child learning from their own textbook - the model collapses.
The SOTA is to use a discriminator (often another LLM or ML algo) to select the best output before feeding it into the training data. That’s what OpenAI, Anthropic, et al have been doing. One of them just published a paper about it a few weeks ago.
It’s an analogy. The learning materials teachers create for students is very much like synthetic data, it’s just not assembled from algorithmic output.
kids learn, walking talking reading arithmetic and physics, by doing things in the physical world. Adults may speak differently to kids than adults, but it's a stretch to say it's synthetic. Equivalent to synthetic would be a group of kids that just grew up together and made up a novel language.
granted synthetic is closely related to synthesis, but in common parlance synthetic would mean something that is not natural or abiotic in some sense, in case of synthetic data, it should imply data that doesn't occur naturally from human and natural sources. i.e. synthetic data would be exactly the one that is assembled from algorithmic output. granted im not able to explain it as well as i understand it.
Yeah and that’s why we call it “standing on the shoulders of giants.” Humans went through tons of trial and error in every facet of life to get where we are today. We kept the stuff that worked and taught it.
But before humans can understand enough language to ingest that synthetic data, they do a lot of their own discovery based training where they learn about the world physically and absorb the language people around them use, kind of like throwing random internet data at an LLM.
I think the direct action of a person taking their idea and thoughts and going through it many times (making changes / updates / fixes) fits better than eating something. however, I do think you still some form of validation data to ensure these are good changes.
However, I do get the spirit of the article, that as more information generated online is done by LLms, the validity and use of the output decreases
Not sure why you’re downvoted, I think a comparison with prions seems apt and interesting, and bad protein copies that can replicate is essentially an information process. GAN research in recent years showing how you can sabotage a working dog/cat classifier with a one pixel change feels similar to how the tiniest parts of large systems can sometimes undermine the whole completely, albeit with low probability. And finally, since models will bootstrap models that bootstrap models, inevitably there are already subtle issues out there in the wild that may have an incubation period of many years before the downstream effects are completely clear.
The problem is systemic. People believe that the pursuit of monetary and financial profits by corporations will lead to the creation of benevolent artificial intelligence. I personally think this is essentially a religion because it is obvious that the pursuit of profits can not actually create anything benevolent, let alone intelligence.
I think this paper is more focused on figuring out what would happen in the theoretical scenario that most data on the web in the future might be AI generated without being marked as such. As they say,
> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear.
The companies you listed are surely not training the models indiscriminately. In particular they have piles of data for which they can have high confidence that they are written by humans.
Here's Gretel's response to the Nature paper on model collapse. Covers the methodology, and how it's flawed, in detail and highlights a lot of other great synthetic data research.
I must be missing something. Training on the output of your system as if it were validated input seems like an obvious no-no. I'm not talking about using synthetic data (however that might be created in this situation), but rather using anything and everything found on the web as if it were "real", i.e. as if it were human-generated texts rather than the output of the LLM.
In this case of course there are multiple LLMs that are creating text which finds its way to the web, but to the extent that the output of the different LLMs have commonalities, this still seems problematic.
And afaik, there are no metrics or algorithms that reliably distinguish between human-generated and LLM-generated text, at least not for the current generations of LLMs.
> Training on the output of your system as if it were validated input seems like an obvious no-no.
Imagine a scientist inventing theories without testing anything, and then continuing to build on top. Crazy. Not even humans can create absent some kind of feedback or validation from outside. That's why we invented the scientific method.
Isn’t that how math works in some respects? In that, there’s only a hierarchy of consistency (no absolute consistency) for most of math. And we just keep building and building. We tried the absolute consistency route and found it too limiting.
Maybe that this doesn’t work for LLMs is a sign they aren’t on the path to AGI…
Personally I found LLMs horrendous at this kind of stuff. I’m basically a RLHF peon by trade and if I’m ever needing a quick way to fool a model, I go to simple logical problems, where it can’t lean on external structures, only itself. I don’t mean logical syntax but logical reasoning. I can’t share recent stuff but a just a few months ago the models I work with failed to reason removing 12 cards from a regular deck couldn’t remove an entire suit. That kind of stuff. Why would I want to make my prompt longer and more detailed to provide it extra structure (which is logically superfluous) to ensure it gets the right answer. Im sure a wordy prompt could get it to the right answer. I’m interested in its ability to “reason”, not prompt engineering.
Given that math is devoid of external structure, I wonder if there something to this (it’s at least interesting to speculate)
I think you're right. When I was experimenting with llama 1, I was able to easily observe that with a short prompt and a long response, the response _rapidly_ degraded the longer it went, because it was seeing and amplifying the patterns in its context window so far.
It is intuitively obvious that these problems would get even worse if the garbage output found its way into the training set, and not just into the context window.
It's _relatively_ easy, I think to filter out sites with a large proportion of low quality ai-generated glurge.
Then you're left with a lot of AI generated or assisted content that has quite often been filtered and modified by humans, so that might mitigate some of the problems that cause model collapse because the filtered content _should_ better reflect reality or desirable output?
I mean a fair bit of content on Reddit and Twitter is machine generated now, right? And content on Reddit and Twitter is being used to train new models, right?
Training on ai-generated data isn't a problem, and has been routinely done by everyone for 18 mo +.
The issue is training on 'indiscriminate' ai-generated data. This just leads to more and more degenerate results. No one is doing this however, there is always some kind of filtering to select which generated data to use for training.
So the finding of that paper are entirely not surprising, and frankly, intuitive and already well known.
Good venues include main track NeurIPS, ICML, ACL, e.g.
Nature is notorious for publishing PR pieces that don't reproduce, and their ML theory publishing has been quite poor. They do pretty well on things like AlphaGo, materials science, or weather modeling because it's more in their wheelhouse and the results don't require a deep understanding of info theory or ML practice.
Those venues have huge issues with referees. It comes down to who is reviewing the work.
The irony in your comment is that it is related to the paper we are discussing. There is a big problem with poisoning from group-think and self reinforcement in current ML research.
I call this "LLM inbreeding." It's a vicious loop where new models are trained on AI-generated content, resulting in the quality degenerating with each generation.
The article contains no proof of theorem 3.1 and finding counterexamples seems trivial. Adult male weight can be modeled by N(85, 20). You can recursively "train" the model on data it generates without having it collapse. It will stay stationary as long as the samples are large enough.
I believe that counterexample only works in the limit where the sample size goes to infinity. Every finite sample will have μ≠0 almost surely.(Of course μ will still tend to be very close to 0 for large samples, but still slightly off)
So this means the sequence of μₙ will perform a kind of random walk that can stray arbitrarily far from 0 and is almost sure to eventually do so.
I agree. The authors generate a dataset of a similar size as the original and then train on that continuously (e.g. for multiple epochs). That's not what you need to do in order to get new model trained on the knowledge of the teacher. You need to ask the teacher to generate new samples every time, otherwise your generated dataset is not very representative of the totality of knowledge of the teacher. Generating samples every time would (in infinite limit) solve the collapse problem.
Agreed, that's what I struggle to see as well. It's not really clear why the variance couldn't stay the same or go to infinity instead. Perhaps it does follow from some property of the underlying Gamma/Wishart distributions.
Maybe. "Overall, this only shows us how far on average we go from the original distribution, but the process can only ’terminate’ if the estimated variance at a certain generation becomes small enough, i.e. we effectively turn into a delta function." Iiuc, variance is modeled as a random walk that will sooner or later reach on zero. I'm not sure I buy that because the variance "walks" orders of magnitudes slower than the mean and is much more robust for large sample sizes.
3. you still won't get beyond the imititation game boundary without exploration & feedback, i.e. the recursive improvement doomers are, as of now, still wrong
> 1. this is nothing that should surprise anyone who has an intuition on control theory and the evolution of unconstrained markov chains
You don't even need to know what a markov chain is. It is intuitively obvious to anyone with two brain cells to rub together that AI can't improve by eating its own vomit.
"Given that training a single moderately large model produces twice the American lifetime’s worth of CO2 (ref. 15), we opted to not run such an experiment and instead focus on a more realistic setting for a proof of concept."
There are other ways AI can help train other AI that aren't generating data. AI could remove low quality data from a training set. It could assist humans in structuring video, 3D and physics simulation datasets for the best learning results.
So they fine tuned an existing model using its own completions to produce the training set for the next run which uses the fine tuned model as the base. They mention catastrophic forgetting so they are aware of it. I suppose they wanted to get results as quickly as possible but this isn’t an accurate model of reality (pun not intended). They’ve only succeeded in demonstrating something that is well known. If they had made the effort to simulate mitigation of bad data and a growing corpus that included proportionally more synthetic data over time it would have been interesting.
I thought this was fairly obvious. Imperfections would only compound over time. Does anyone remember recursively inter-translating between two languages?
If I'm correct, we generally percieve AI generated data to be indistinguishable from a human sourced data and we don't have a tool to reliably assess whether a text is AI generated.
However, could it be that texts generated by AI models posses some kind of statistical property which causes training to collapse?
Then, would it allow us to use it to detect AI texts?
Maybe this is true test of intelligence instead of "emulating intelligence"?
I can learn from Pythagorus' work, extend it, combine it, apply it, and produce works that are more valuable than the original. Perhaps that gets recognized as important, and others then take that, learn, and repeat the process adding their own experience, increasing the general intelligence.
This is about language models. They include plenty of real-world concepts that are essential to language. But they are not models of intelligence or knowledge or reasoning.
Using generated training data is a good way to ensure that the training includes things that are too obvious to appear in normal writing. (Such as "there are zero giraffes in this photo.") This paper describes the limits of using transformer-generated data to train other transformers.
Conceptually, a LLM is a lossy compression of all of the data it saw during training. If you feed it lossy data, at each iteration you will get poorer and poorer signal and more noise.
Prior generations learned this by copying VHS tapes over and over and making photocopies of photocopies. You can see it today by opening and saving a JPG over and over again.
If they're considering Reddit content to be free of generated material, I've got bad news for them. It's not quite the Chernobyl-grade hole that Pinterest has become, but it's hardly "low background".
I still believe reddit is an amazing source. Any article you read on reddit, chances are the comments are better than the original text. They will debunk the article, present a diversity of reactions, and most importantly, they will be grounded in public opinion unlike the press which caters to money interests.
You just copy-paste a conversation into the LLM and ask for an article. For taste, here is one generated from this very conversation. https://pastebin.com/raw/JFH6PGqg
> Any article you read on reddit, chances are the comments are better than the original text.
We're talking about reddit dot com here? Seriously? I find it difficult to find any comments worth reading at all on that website. 99% of the stuff that isn't buried is just the same recycled jokes again and again and again.
Given a time snapshot and enough computing power, isn't recursion inevitable? It's like running out of known universe given time x. So then we're back creating data without a prior dataset, which is still a human domain.
It’s a lossy transformation, so you’re losing information each time. It’s never going to add information.
However, some information is junk that obscures the good stuff. It’s likely that how they train today is very inefficient compared to what’s possible, and there will be smarter ways to transform preexisting data so that it’s a better dataset to train on, without losing very much.
> there will be smarter ways to transform preexisting data so that it’s a better dataset to train on, without losing very much
Like, take for example search. Instead of training on a bunch of scraped texts, you take one prompt, select 10 references, and use it to synthesize an answer. Referencing multiple texts gives you more than training on them directly. The LLM could catch contradictions, observe the distribution of human opinions, note if the topic is controversial. And then output a wikipedia-like article. Do this billions of times, and you got a refined dataset. You can iterate on top, using the articles as source and writing meta articles. Or just silly studies like writing a paper about "Characters named Charlie in literature". You can slice and dice the data in any way, and analyze the cross section.
Let's restate. I'd imagine you end up in local minima that are difficult to escape using model generated data. So sure, non-zero gradients, but if you plot the gradients, I would expect them to orbit at that point. But it seems like they diverge.
Mini-batches and dropout mean that you are constantly jumping out of and into other minima during training of any type (highly-redundant solution space is an important feature of deep learning). This is deliberate and necessary to explore the gigantic parameter space of these huge LLM models.
If the model collapse means that the text produced by it is not statistically identical to the garbage that fills the Internet - then I guess a collapse is the goal.
Or navel-gazing. In fact, that's one of the classically known flaws. (So well known that it has many names: ivory tower, navel gazing, getting stuck in your own head...)
If you don't compare your thoughts to the outside world, it's easy for them to diverge more and more from reality.
It's important to note that outside world means the actual world, not the thoughts of other humans. You need a way to establish ground truth, which comes from observing the actual outcome of actions and experiments.
As far as I understand Douglas Hofstadter's Godel, Escher, Bach - self-referential recursive structures (strange loops) are the foundation of consciousness (among other interesting things). I've been watching to see if LLM's becoming self-referential actually improves them as opposed to degrades them.
Back when I was getting my econ degree, we were taught about the Ultimatum game, which goes like this: You get two participants who don't know each other and will (ostensibly) never see each other again. You give one of them $100, and they make an offer of some portion of it to the other. If the other accepts, both parties keep their portion - so, if A offers B $20, and B accepts, A keeps $80 and B keeps $20, if B rejects, both parties get nothing. Standard economic theory suggests A can offer $1 and B will accept, because otherwise B gets nothing. Spoiler for those of you who haven't seen how standard economic theory plays out in real life, that's not how the game went - typically, offers below ~$30 or so got rejected, because B was a real feeling person who felt like they were getting screwed and opted to punish A for doing so. The exception to this - the people who would take the $1 offer - were people who had been taught economic theory. It turns out you _could_ screw them over and they'd pat themselves on the backs for being very wise.
The "tragedy of the commons" is another one of those parts of standard economic theory that never actually played out in reality - we've got examples from all over the world of communities implementing practices and often entire belief systems that led them to be responsible stewards of shared resources without requiring unilateral ownership of that resource and singular acquisition of the benefits of that stewardship, and yet first on the lips of every modern capitalist when describing why they're at a disadvantage if they're not the ones polluting the water supply is the tragedy of the commons.
Rebecca Solnit wrote a book, "A Paradise Built in Hell", on how people behave during disasters, and found broadly the same thing - contra the prepper myths, most people most of the time faced with disaster come together to work cooperatively to help each other.
We're a fundamentally social species - we've got smaller brains than Neanderthals did, we're not a particularly tough species, but we're very, very good at cooperating with each other.
Each player can limit the other's income to $0 - the offerer can offer $0 and the receiver can reject any deal.
So then what's optimal? $50 seems obviously fair, but does that mean we ought to reject offers of $49 100% of the time? Not quite, to limit the opponent's expected income for an offer of $49 to $50 instead of the $51 they left for themselves, we can use a mixed strategy that only accepts the offer with probability 50/51. Extending that gives the opponent a benefit curve that is linear as they leave themselves more money up to $50 and then flat at $50 afterwards.
That's good, but we can make it better - if we accept offers for $X<$50 with probability 50/(100-X) - epsilon*(50-X), then their expected benefit curve is smooth and has a peak at $50, which is the most we can expect to make except against a generous opponent.
After all that, playing this game as stated against an unknown opponent there's a lot of uncertainty. Maybe all your opponents are entirely irrational and move at random. Maybe all your opponents have colluded and decided that $66 for the offerer and $34 for the receiver is fair and that's the only deal they'll make. But if you think that random actors in the universe are reasonably intelligent and can discover the equilibrium above with the thought worth putting into this Ultimatum game, the receiver strategy above properly aligns incentives.
> It turns out you _could_ screw them over and they'd
End up with a dollar in their pocket which they otherwise wouldn't have.
The Ultimatum game is a useful insight into human psychology: for one thing, it tells us who thinks that the defector in this equilibrium is better off than a counterfactual cooperator.
Ah, but they have their pride! Ok. My pride is not affected by someone else having 99 bucks they didn't earn, and myself $1 likewise. Maybe that other fellow really needed the money.
I don't know what the hell you're talking about. Your argument is incoherent. If you wanted to allocate the money according to the individual's utility of money, then a rule of thumb of $1 is going to be wrong. You should, given no information, assume that both have the same utility of money and that the utility of money is diminishing, favouring an even split.
You may be interested in some of the foundational papers exploring game theory models similar to the Ultimatum game[1][2]. These are known as Iterated Prisoner's Dilemmas.
It's crazy how most political or economic systems would very obviously collapse in the real world almost instantly without some kind of voluntary moral contract (explicit or implied), yet we've got huge clumps of people demonizing one system or another based on the context of what happens when you implement it in a morally dead societal context.
Like there are a ton of people who smirk at your last paragraph and go "nuh uh, hashtag late stage capitalism"
A hundred percent. I've said this elsewhere, but a primary problem for at least American society at this point is we don't have a commonly-agreed upon moral system other than the market - things like Martin Shkreli buying drugs people need to live and jacking the price up are Bad, but we don't have a common language for describing why it's immoral, whereas our only real common shared language, the market, is basically fine with it as long as it's legal. A lot of the market logic works fine for society within constraints - optimize your costs, but not at the expense of your workers; increase your prices if you can, but don't be a ghoul about it; lobby for your position, but don't just buy a supreme court judge.
If you iterate the game, it’s obvious. I, as the responder, control the proposer’s income. Extend to infinity with knowledge of iteration and you reach symmetry between proposer and responder.
We're shockingly bad at doing this in modern society. Our temporal planning horizon is somewhere between 6 months and 5 years, whereas our lifespans are around 75-80.
...in the real world, A tells B that he "sourced" the deal and therefore deserves a bigger cut and in the real world, B agrees up to a point (the $30 mark). Over time and rounds of playing the game, the A's of the world learn where the line is and optimize to stay on the correct side of it, only testing the other side 1-2% of the time to see if rules/behavior has changed.
This seems extremely interesting, but I don't have the time right now to read this in depth (given I would also need to teach myself a bunch of technical concepts too).
Anyone willing to weigh in with a theoretical intuition? The one in the paper is just a little inaccessible to me right now.
The key word there is "indiscriminate". All of the big AI labs have been training on synthetic data for at least a year at this point, but they're doing so deliberately.
I don't think the "model collapse" problem is particularly important these days. The people training models seem to have that well under control.