On Thu, Jul 03, 2025 at 11:30:48AM +0200, Jörg Sonnenberger wrote:
> On 7/3/25 6:23 AM, Constantine A. Murenin wrote:
> > Can you really blame kids for looking at all 5000 links from a single
> > file, when you give them 5000 links to start with?  Maybe start by not
> > giving the 5000 unique links from a single file, and implement caching
> > / throttling?  How could you know there's nothing interesting in there
> > if you don't visit it all for a few files first?
> 
> Are you intentionally misrepresenting the problem?
> 
> > These AIs literally behave the exact same way as humans; they're
> > simply dumber and more persistent.  The way CVSweb is designed, it's
> > easily DoS'able with the default `wget -r` and `wget --recursive` from
> > probably like 20 years ago?
> 
> This is complete BS. "wget -r" uses a single connection (at any point in
> time). It uses a consistent source address. It actually honors robots.txt by
> default. None of that applies to the current generation of AI scrapers:
> 
> (1) They have no effective rate limiting mechanism on the origin side.
> (2) They are intentionally distributing requests to avoid server side rate
> limits.
> (3) The combination of the two makes most caching useless.
> (3) They (intentionally or maliciously) do not honor robots.txt.
> (4) They are intentionally faking the user agent.

I have heard these claims a few times, but don't think I have seen any
more in-depth analysis about these - do you happen to have a link with
a more detailed analysis?

Personally, I am seeing gptbot crawling at a rate of up to about 1
request per second. On the other hand, I have seen Scrapy-based
crawlers hitting my web sites at full speed over multiple concurrent
connections, but I am not sure these are connected to the AI scrapers.


Christof

-- 
https://cmeerw.org                             sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org                   xmpp:cmeerw at cmeerw.org

Reply via email to