Crawler operators, please stop destroying the commons

Goodwill for crawling/scraping has rapidly been depleted
by

Badly edited "Silent Protector" meme showing Techaro Anubis defending the Gitea and Forgejo logos from an influx of logos associated with scraping. Anubis mascot by CELPHASE.

Wanted to grab a lot of data for your project or research? We had a tool for that called public datasets1.

"Hello I would like to GET /myFirstReactProject/commits/c33fe277 50 times from different IPs just in case it changed in the last 5 femtoseconds"

They have played us for absolute fools.

Stop writing new poorly made crawlers and taking down people's services2 through excessive crawing.
We do not need thousands of barely tested web crawlers grabbing the same pages; each a new opportunity for poorly thought out or broken pacing to hit sites with denial of service levels of traffic.
Please please please just3 use common crawl data.

We're rapidly heading towards the point where everyone is rightfully running anti-bot tech4 to keep their services up because we couldn't do the sensible thing and use the massive open crawl dataset that's a community project, instead we all had to have our own Special Sauce at each company because I guess that's more impressive than "yeah I downloaded common crawl".

There used to be goodwill towards crawlers, an understanding that we needed them to find content.
Oops, burned all the goodwill.

Nobody benefits from this new equilibrium.


1

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2008.

3

"If your solution to some problem relies on “If everyone would just…” then you do not have a solution. Everyone is not going to just. At no time in the history of the universe has everyone just, and they’re not going to start now."

4

Anubis weighs the soul of your connection using a sha256 proof-of-work challenge in order to protect upstream resources from scraper bots.


Cite as BibTeX
@misc{crawlers-please-stop-destroying-the-commons,
    author = {Luna Nova},
    title = {Crawler operators, please stop destroying the commons},
    year = {2025},
    url = {https://lunnova.dev/articles/crawlers-please-stop-destroying-the-commons/},
    howpublished = {https://lunnova.dev/articles/crawlers-please-stop-destroying-the-commons/},
    urldate = {2025-04-19},
    note = {lunnova.dev - Goodwill for crawling/scraping has rapidly been depleted}
}

tagged rant