US publishers tell Common Crawl to stop scraping and delete archive
thm
29 points
10 comments
June 09, 2026
Related Discussions
Found 5 related stories in 96.0ms across 10,002 title embeddings via pgvector HNSW
- Government backtracks on AI and copyright after outcry from major artists chrisjj · 28 pts · March 18, 2026 · 52% similar
- News outlets are limiting the Internet Archive’s access to their journalism jaredwiener · 254 pts · May 21, 2026 · 52% similar
- Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record pabs3 · 507 pts · March 21, 2026 · 50% similar
- Media scraper Gallery-dl is moving to Codeberg after receiving a DMCA notice MoltenMonster · 94 pts · April 06, 2026 · 50% similar
- Aggressive AI scrapers are making it kinda suck to run wikis cookmeplox · 19 pts · March 13, 2026 · 50% similar
Discussion Highlights (5 comments)
toomuchtodo
Crawling will go underground à la Anna’s Archive.
khelavastr
This is shady. Copyrighters absolutely not get to control use of their copyrighted material when people mentally, sonically, or physically reproduce it for personal use. It's absurd to say "you can't record this book to a friend or robot". Nobody seems to actually reproduce the copyrighted materials. High-dimensional eigendecompositions which underpin AI similarity are some of the most literally derivative materials of texts that you can imagine.
Stagnant
Hard to see any practical benefit to go after common crawl. The situation of freely accessible crawled data is bad enough as it is with archive.org and CC being pretty much the only available sources. We need more initiatives like them, not less. The scary thing is how the anti-AI sentiment is being used to lock things down further.
Grimblewald
Its like they don't understand the problem common crawl solved rather neatly. You think the skid scrapers are bad? Wait till the competent players lose access to CC.
mindcrime
These guys (the publishers) are fighting last year's war. Nobody (to a first approximation) gives a shit about going to the NY Times website, or The Guardian website, or the BBC website, etc. to find information. They expect to use search engines and AI services to find stuff, and then maybe click through to the source site(s) for more details or whatever. The publishers need to rethink their entire take on how the Internet works or any "victory" they earn is going to be extremely Pyrrhic.