Media
Related: About this forumLong but serious & important: News Publishers Are Now Blocking The Internet Archive, And We May All Regret It --Techdirt
(bolded first lines and/or sentences for visual flow)
Much, much more detail at https://www.techdirt.com
... News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.
But blocking the Internet Archive isnt going to stop AI training. What it will do is ensure that significant chunks of our journalistic record and historical cultural context simply
disappear.
And thats bad.
The Internet Archive is the most famous nonprofit digital library, and has been operating for nearly three decades. It isnt some fly-by-night operation looking to profit off publisher content. Its trying to preserve the historical record of the internetwhich is way more fragile than most people comprehend. When websites disappearand they disappear constantlythe Wayback Machine is often the only place that content still exists. Researchers, historians, journalists, and ordinary citizens rely on it to understand what actually happened, what was actually said, what the world actually looked like at a given moment.
In a digital era when few things end up printed on paper, the Internet Archives efforts to permanently preserve our digital culture are essential infrastructure for anyone who cares about historical memory.
And now were telling them they cant preserve the work of our most trusted publications.
Think about what this could mean in practice. Future historians trying to understand 2025 will have access to archived versions of random blogs, sketchy content farms, and conspiracy sitesbut not The New York Times. Not The Guardian. Not the publications that we consider the most reliable record of whats happening in the world. Were creating a historical record thats systematically biased against quality journalism.
Yes, Im sure some will argue that the NY Times and The Guardian will never go away. Tell that to the readers of the Rocky Mountain News, which published for 150 years before shutting down in 2009, or to the 2,100+ newspapers that have closed since 2004. Institutionseven big, prominent, established onesdont necessarily last.
As one computer scientist quoted in the Nieman piece put it:
Common Crawl and Internet Archive are widely considered to be the good guys and are used by the bad guys like OpenAI, said Michael Nelson, a computer scientist and professor at Old Dominion University. In everyones aversion to not be controlled by LLMs, I think the good guys are collateral damage.
Thats exactly right. In our rush to punish AI companies, were destroying public goods that serve everyone.
The most frustrating bit of all of this:
The Guardian admits they havent actually documented AI companies scraping their content through the Wayback Machine. This is purely precautionary and theoretical. Theyre breaking historical preservation based on a hypothetical threat:
The Guardian hasnt documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, its taking these measures proactively and is working directly with the Internet Archive to implement the changes.
And, of course, as one of the good guys of the internet, the Internet Archive is willing to do exactly what these publishers want. Theyve always been good about removing content or not scraping content that people dont want in the archive. Sometimes to a fault. But you can never (legitimately) accuse them of malicious archiving (even if music labels and book publishers have).
Either way, were sacrificing the historical record not because of proven harm, but because publishers are worried about what might happen. Thats a hell of a tradeoff.
This isnt even new, of course. Last year, Reddit announced it would block the Internet Archive from archiving its forumsdecades of human conversation and cultural historybecause Reddit wanted to monetize that content through AI licensing deals. The reasoning was the same: cant let the Wayback Machine become a backdoor for AI companies to access content Reddit is now selling. But once you start going down that path, it leads to bad places.
The Nieman piece notes that, in the case of USA Today/Gannett, it appears that there was a company-wide decision to tell the Internet Archive to get lost:
In total, 241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots.
Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welshs original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: archive.org_bot and ia_archiver-web.archive.org. These bots were added to the robots.txt files of Gannett-owned publications in 2025.
Some Gannett sites have also taken stronger measures to guard their contents from Internet Archive crawlers. URL searches for the Des Moines Register in the Wayback Machine return a message that says, Sorry. This URL has been excluded from the Wayback Machine.
A Gannett spokesperson told NiemanLab that it was about safeguarding our intellectual property but thats nonsense. The whole point of libraries and archives is to preserve such content, and theyve always preserved materials that were protected by copyright law. The claim that they have to be blocked to safeguard such content is both technologically and historically illiterate.
And heres the extra irony: blocking these crawlers may not even serve publishers long-term interests. As I noted in my earlier piece, as more search becomes AI-mediated (whether you like it or not), being absent from training datasets increasingly means being absent from results. Its a bit crazy to think about how much effort publishers put into search engine optimization over the years, only to now block the crawlers that feed the systems a growing number of people are using for search. Publishers blocking archival crawlers arent just sacrificing the historical recordthey may be making themselves invisible in the systems that increasingly determine how people discover content in the first place.
The Internet Archives founder, Brewster Kahle, has been trying to sound the alarm:
If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.
But that warning doesnt seem to be getting through. The panic about AI has become so intense that people are willing to sacrifice core internet infrastructure to address it.
What makes this particularly frustrating is that the internets openness was never supposed to have asterisks. The fundamental promise wasnt publish something and its accessible to all, except for technologies we decide we dont like. It was just open. You put something on the public web, people can access it. That simplicity is what made the web transformative.
Now were carving out exceptions based on who might access content and what they might do with it. And once you start making those exceptions, where do they end? If the Internet Archive can be blocked because AI companies might use it, what about research databases? What about accessibility tools that help visually impaired users? What about the next technology we havent invented yet?
This is a real concern. People say oh well, blocking machines is different from blocking humans, but thats exactly why I mention assistive tech for the visually impaired. Machines accessing content are frequently tools that help humansincluding me. I use an AI tool to help fact check my articles, and part of that process involves feeding it the source links. But increasingly, the tool tells me it cant access those articles to verify whether my coverage accurately reflects them.
I dont have a clean answer here. Publishers genuinely need to find sustainable business models, and watching their work get ingested by AI systems without compensation is a legitimate grievanceespecially when you see how much traffic some of these (usually less scrupulous) crawlers dump on sites. But the solution cant be to break the historical record of the internet. It cant be to ensure that our most trusted sources of information are the ones that disappear from archives while the least trustworthy ones remain.
We need to find ways to address AI training concerns that dont require us to abandon the principle of an open, preservable web. Because right now, were building a future where historians, researchers, and citizens cant access the journalism that documented our era. And thats not a tradeoff any of us should be comfortable with.
Note to copyright alerters: It is Techdirt policy to always provide most of its content, including articles and podcasts, to anyone for free. It never sues for copyright infringement.
calimary
(89,386 posts)Thanks, ancianita!
