A few news outlets, including Gizmodo and Ars Technica, have written about the study.
Most of what I quote here on out is from Ars Technica.
The results showed that adding more “junk data” to the training sets had a statistically significant effect on the reasoning and long-context benchmarks across models. The effects were more mixed on the other benchmarks, though. For example, a 50/50 mix of “junk” and control data used for the Llama 8B model generated better scores on some benchmarks (ethical norms, high Openness, low Neuroticism, and Machiavellianism) than either “fully junk” or “fully control” training data sets.
Based on these results, the researchers warn that “heavily relying on Internet data leads LLM pre-training to the trap of content contamination.” They go on to “call for a re-examination of current data collection from the Internet and continual pre-training practices” and warn that “careful curation and quality control will be essential to prevent cumulative harms” in future models.
Gizmodo takes it a bit further:
And wouldn’t you know it, it turns out that consuming directly from the internet landfill that is X isn’t great for thinking clearly. All four models tested—Llama3 8B, Qwen2.5 7B/0.5B, Qwen3 4B—showed some forms of cognitive decline. Meta’s Llama proved the most sensitive to the junk, seeing drops in its reasoning capabilities, understanding of context, and adherence to safety standards. Interestingly, a much smaller model, Qwen 3 4B, proved more resilient, though still suffered declines. It also found that the higher the rates of bad data, the more likely a model was to slip into “no thinking” mode, failing to provide any reasoning for its answer, which was more likely to be inaccurate.
This, of course, tracks. Humans, consuming a lot of "junk" will also exhibit these traits. It all goes back to the old computer standard of "Garbage In, Garbage Out."
Harry Mudd, of course knew this all along. Once Spock showed him the way:
As usal, my friend Curtis Clark nails it:

No comments:
Post a Comment