On Thu, Jul 03, 2025 at 11:30:48AM +0200, Jörg Sonnenberger wrote: > On 7/3/25 6:23 AM, Constantine A. Murenin wrote: > > Can you really blame kids for looking at all 5000 links from a single > > file, when you give them 5000 links to start with? Maybe start by not > > giving the 5000 unique links from a single file, and implement caching > > / throttling? How could you know there's nothing interesting in there > > if you don't visit it all for a few files first? > > Are you intentionally misrepresenting the problem? > > > These AIs literally behave the exact same way as humans; they're > > simply dumber and more persistent. The way CVSweb is designed, it's > > easily DoS'able with the default `wget -r` and `wget --recursive` from > > probably like 20 years ago? > > This is complete BS. "wget -r" uses a single connection (at any point in > time). It uses a consistent source address. It actually honors robots.txt by > default. None of that applies to the current generation of AI scrapers: > > (1) They have no effective rate limiting mechanism on the origin side. > (2) They are intentionally distributing requests to avoid server side rate > limits. > (3) The combination of the two makes most caching useless. > (3) They (intentionally or maliciously) do not honor robots.txt. > (4) They are intentionally faking the user agent.
I have heard these claims a few times, but don't think I have seen any more in-depth analysis about these - do you happen to have a link with a more detailed analysis? Personally, I am seeing gptbot crawling at a rate of up to about 1 request per second. On the other hand, I have seen Scrapy-based crawlers hitting my web sites at full speed over multiple concurrent connections, but I am not sure these are connected to the AI scrapers. Christof -- https://cmeerw.org sip:cmeerw at cmeerw.org mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org