Humans Vs. Web Scrapers
Positive developments toward a sustainable web publishing future
Last week I wrote about a project I'm working on that ethically scrapes websites for music-related information. Using AI code assistants to create systems capable of massive information gathering and synthesis is remarkably easy. This highlighted to me how crucial proper licensing and content access methods are for protecting web publishing and human-generated content. Here are some recent developments in web scraping and how these innovations might reshape how LLMs interact with the rest of the digital world.
The story of one recent development actually starts decades ago before social media dominated the web.
RSS and what came after
In the 2000s, before social media funneled web publishing to their platforms, users like me discovered content through RSS feeds and aggregated them using tools like Google Reader. I never knew at the time that RSS stood for Really Simple Syndication. I didn't think to Google it, probably because I was too busy enjoying articles through Google Reader.
Like many good things, RSS and Google Reader fell victim to social media. This happened in two ways. Within Google, the Reader team was reassigned to work on Google Plus, despite the fact that Google Plus never achieved the traffic levels that Google Reader enjoyed. Across the industry, social media platforms like Facebook lured news sites with promises of higher engagement and click-through traffic. With algorithmic feeds presenting users with content supposedly tailored to their preferences, what could go wrong?
In the decades since this migration to social platforms, digital publishing has struggled to survive. Now, LLMs scraping content without driving traffic to original sources may deliver the final blow to web publishing as a viable livelihood for writers online.
In a delightful twist of irony, a glimmer of hope for web publishing's future comes from one of RSS's original creators.
Really Simple Licensing
Created by the RSL Collective, the RSL standard aims to be an enhancement to robots.txt that standardizes how publishers license their content. This machine-readable license format clearly defines required attribution and payment terms for both crawling and inference.
One of the group's leaders is RSS co-creator, Eckart Walther. Though I haven't found interviews where Walther connects his past work on RSS to RSL, I appreciate that someone who once created a standard making online writing accessible (and in retrospect, sustainable) is now working to make news articles and creative writing sustainable again.
LLMs need to be knowledgeable and current to provide relevant answers to users. This requires training on vast amounts of online information and access to web scraping tools. The explosion of LLM usage works well for everyone except the human writers whose work forms the foundation these LLMs rely on.
The current system isn’t sustainable. If writers can't earn from their work, the situation will worsen beyond the damage already caused by decades of algorithmic chaos on social media. We'll likely see more lawsuits like the recent one where Anthropic agreed to pay hundreds of thousands of writers $1,500 each.
RSL offers an escape from this unsustainable future by standardizing how content creators tell AI companies and scrapers what content can be used and under what conditions. This could create consistent revenue streams to stabilize and nurture web publishing. We need this to work. Web publishing represents writing at its most vital and urgent, and our culture suffers when viewpoints diminish online and reliable news sources dwindle.
Whatever success my project achieves will largely come from value developed through material sourced online via APIs and rules in robots.txt files. Building my music discovery database has shown me the immense power of a smart script and well-constructed database when unleashed on the internet.
Cloudflare’s strategy
RSL is just one piece of the puzzle being assembled to protect writers and content owners against unlawful or impolite scraping. Since Cloudflare serves about 20% of web content online, any strategy they deploy will likely create ripple effects throughout the industry.
Cloudflare's strategy begins with blocking web scrapers by default unless content owners explicitly allow scraping. They also empower content owners to define specific scraper access rules: which parts of the site are accessible and under what conditions. Additionally, they're exploring a "pay per crawl" model. If Cloudflare or another company successfully productizes this approach, it would enable new business models for companies that cultivate and sell data to AI companies for training. While this is already happening to some extent, easily accessible, widely adopted tooling would significantly lower entry barriers for startups.
While the exact outcome remains uncertain, it's encouraging to see many in the tech industry working toward a sustainable future for web publishing.
Preserving human creativity online
With 700 million weekly ChatGPT users, adoption is both widespread and rapidly increasing. Yet for authors, filmmakers, musicians, and other creative professionals, generative AI often represents industrial-scale appropriation of their work.
These concerns are valid. AI companies have acknowledged in court documents that they haven't always adhered to copyright laws. Anthropic's recent $1.5 billion settlement with authors demonstrates that courts won't simply allow AI companies to disregard copyright protections in their pursuit of artificial general intelligence.
Practical, standardized licensing frameworks like RSL, paired with enforcement tools such as those Cloudflare are developing, can create a more sustainable and fair ecosystem for creatives, AI companies, and their users. Whether these measures will be sufficient to preserve web content creation remains uncertain, but it's encouraging that some in the tech industry are making sincere efforts to find equilibrium.
Fair compensation for creators is essential for reducing tensions between creative communities and AI. Another crucial element is transforming how online content is presented to clearly differentiate between human and AI-generated material. We may need entirely new design approaches for news readers and content platforms that intuitively signal whether users are consuming human or machine-created content.
