On Thu, 3 Jul 2025 11:59:23 +0200
Christof Meerwald <cme...@cmeerw.org> wrote:

> > (1) They have no effective rate limiting mechanism on the origin side.
> > (2) They are intentionally distributing requests to avoid server side rate
> > limits.
> > (3) The combination of the two makes most caching useless.
> > (3) They (intentionally or maliciously) do not honor robots.txt.
> > (4) They are intentionally faking the user agent.
> 
> I have heard these claims a few times, but don't think I have seen any
> more in-depth analysis about these - do you happen to have a link with
> a more detailed analysis?
> 
> Personally, I am seeing gptbot crawling at a rate of up to about 1
> request per second. On the other hand, I have seen Scrapy-based
> crawlers hitting my web sites at full speed over multiple concurrent
> connections, but I am not sure these are connected to the AI scrapers.

I've been operating an anti-scraper tarpit since mid 2024. In the past
hour it's seen 32269 hits from 9582 addresses, presenting 1466
User-Agents (ie, randomized.) A new IP will show up, hit one link deep
in the tarpit, and disappear for weeks.

These numbers are fairly typical load but I've documented surges
upwards of 2500+ new TCP connections per second; I typically end up
banning an entire /16 or two to recover my VM when it happens.

Other operators I've been in touch with report similar.

Whatever the cause, it is definitely true that something is targetting
websites, especially those with lots of source code and/or images like
photography blogs, with overwhelming distributed force.

-- 
Aaron B. <aa...@zadzmo.org>

Reply via email to