On 04/07/2025 10:19, Michael J Gruber wrote:
Jelle van der Waa venit, vidit, dixit 2025-07-04 10:04:42:
Hi,
On 03/07/2025 22:18, Kevin Kofler via devel wrote:
Leigh Scott wrote:
Why isn't fedora infra using Anubis to block LLM scrappers?
Why should they? Anubis is a scourge that wastes massive energy for all
legitimate browsers, breaks search engines, and if configured in a
particularly aggressive way as on the GNOME GitLab, even entirely locks out
some browsers (though that is an issue with the setup at GNOME
specifically).
I just want to point out that this is completely false, Anubis does not
break search engines they are allowlisted to go through without a
challenge. Only the useragents with "Mozilla" in them are being "checked".
The "wasted" cycles are only incurred once per week (that's how long the
cookie is valid). And you didn't account for the massive energy wasted
by AI scrapers :)
I was wondering what other websites do. I mean, Fedora's are certainly
not the only ones being AI-scraped, and I hadn't heard of that being an
issue before. So there have to be practical solutions.
I don't know the exact details of Fedora's issues but for Arch Linux the
two services mostly impacted are:
* The AUR, it hosts a web cgit interface, and AI's crawl **everything**,
blame's, diffs. cgit is just not very efficient and generates git blames
and diffs on the fly which is very CPU intensive.
* The Arch wiki, even though it is mostly cached and readonly for wiki
pages still succumbed under AI scrapers. Mediawiki also has some history
endpoints with diffs which are likely very inefficient and not cacheable.
So a big issue is that these AI crawlers hit endpoints which are
expensive and crawl every them all at once from 100s of different IPs.
Furthermore they don't seem to cache anything so just re-crawl every X time.
--
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it:
https://pagure.io/fedora-infrastructure/new_issue