I have mixed feelings about this class-action lawsuit against OpenAI and Microsoft, claiming that it “scraped 300 billion words from the internet” without either registering as a data broker or obtaining consent. On the one hand, I want this to be a protected fair use of public data. On the other hand, I want us all to be compensated for our uniquely human ability to generate language.
There’s an interesting wrinkle on this. A recent paper showed that using AI generated text to train another AI invariably “causes irreversible defects.” From a summary:
The tails of the original content distribution disappear. Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions. We call this effect model collapse.
Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data.
This is the same idea that Ted Chiang wrote about: that ChatGPT is a “blurry JPEG of all the text on the Web.” But the paper includes the math that proves the claim.
What this means is that text from before last year—text that is known human-generated—will become increasingly valuable.
More Stories
Clever Social Engineering Attack Using Captchas
This is really interesting. It’s a phishing attack targeting GitHub users, tricking them to solve a fake Captcha that actually...
US Cyberspace Solarium Commission Outlines Ten New Cyber Policy Priorities
In its fourth annual report, the US Cyberspace Solarium Commission highlighted the need to focus on securing critical infrastructure and...
Cybersecurity Skills Gap Leaves Cloud Environments Vulnerable
A new report by Check Point Software highlights a significant increase in cloud security incidents, largely due to a lack...
Going for Gold: HSBC Approves Quantum-Safe Technology for Tokenized Bullions
The bank giant and Quantinuum trialed the first application of quantum-secure technology for buying and selling tokenized physical gold Read...
This Windows PowerShell Phish Has Scary Potential
Many GitHub users this week received a novel phishing email warning of critical security holes in their code. Those who...
Infostealers Cause Surge in Ransomware Attacks, Just One in Three Recover Data
Infostealer malware and digital identity exposure behind rise in ransomware, researchers find Read More