I have mixed feelings about this class-action lawsuit against OpenAI and Microsoft, claiming that it “scraped 300 billion words from the internet” without either registering as a data broker or obtaining consent. On the one hand, I want this to be a protected fair use of public data. On the other hand, I want us all to be compensated for our uniquely human ability to generate language.
There’s an interesting wrinkle on this. A recent paper showed that using AI generated text to train another AI invariably “causes irreversible defects.” From a summary:
The tails of the original content distribution disappear. Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions. We call this effect model collapse.
Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data.
This is the same idea that Ted Chiang wrote about: that ChatGPT is a “blurry JPEG of all the text on the Web.” But the paper includes the math that proves the claim.
What this means is that text from before last year—text that is known human-generated—will become increasingly valuable.
More Stories
The AI Fix #30: ChatGPT reveals the devastating truth about Santa (Merry Christmas!)
In episode 30 of The AI Fix, AIs are caught lying to avoid being turned off, Apple’s AI flubs a...
US and Japan Blame North Korea for $308m Crypto Heist
A joint US-Japan alert attributed North Korean hackers with a May 2024 crypto heist worth $308m from Japan-based company DMM...
Spyware Maker NSO Group Found Liable for Hacking WhatsApp
A judge has found that NSO Group, maker of the Pegasus spyware, has violated the US Computer Fraud and Abuse...
Spyware Maker NSO Group Liable for WhatsApp User Hacks
A US judge has ruled in favor of WhatsApp in a long-running case against commercial spyware-maker NSO Group Read More
Major Biometric Data Farming Operation Uncovered
Researchers at iProov have discovered a dark web group compiling identity documents and biometric data to bypass KYC checks Read...
Ransomware Attack Exposes Data of 5.6 Million Ascension Patients
US healthcare giant Ascension revealed that 5.6 million individuals have had their personal, medical and financial information breached in a...