On 7/3/25 11:59 AM, Christof Meerwald wrote:
On Thu, Jul 03, 2025 at 11:30:48AM +0200, Jörg Sonnenberger wrote:
On 7/3/25 6:23 AM, Constantine A. Murenin wrote:
Can you really blame kids for looking at all 5000 links from a single
file, when you give them 5000 links to start with?  Maybe start by not
giving the 5000 unique links from a single file, and implement caching
/ throttling?  How could you know there's nothing interesting in there
if you don't visit it all for a few files first?

Are you intentionally misrepresenting the problem?

These AIs literally behave the exact same way as humans; they're
simply dumber and more persistent.  The way CVSweb is designed, it's
easily DoS'able with the default `wget -r` and `wget --recursive` from
probably like 20 years ago?

This is complete BS. "wget -r" uses a single connection (at any point in
time). It uses a consistent source address. It actually honors robots.txt by
default. None of that applies to the current generation of AI scrapers:

(1) They have no effective rate limiting mechanism on the origin side.
(2) They are intentionally distributing requests to avoid server side rate
limits.
(3) The combination of the two makes most caching useless.
(3) They (intentionally or maliciously) do not honor robots.txt.
(4) They are intentionally faking the user agent.

I have heard these claims a few times, but don't think I have seen any
more in-depth analysis about these - do you happen to have a link with
a more detailed analysis?

Not easily. There is a lot of documentation by AI scammers on how they want to avoid anti-scraping measures.

Personally, I am seeing gptbot crawling at a rate of up to about 1
request per second. On the other hand, I have seen Scrapy-based
crawlers hitting my web sites at full speed over multiple concurrent
connections, but I am not sure these are connected to the AI scrapers.

My own web sites see moderate scraper traffic, but they don't have a large site graph either. On various other sites like anonhg.n.o, the main Mercurial bug tracker etc we have been observing persistent high load. The worst offenders are those using distributed processing, e.g. dozens of IP ranges, coordinated scans that don't hit the same page twice. That's why "Use caching" is such an ignorant suggestion.

Joerg

Reply via email to