On Thu, 3 Jul 2025 11:59:23 +0200 Christof Meerwald <cme...@cmeerw.org> wrote:
> > (1) They have no effective rate limiting mechanism on the origin side. > > (2) They are intentionally distributing requests to avoid server side rate > > limits. > > (3) The combination of the two makes most caching useless. > > (3) They (intentionally or maliciously) do not honor robots.txt. > > (4) They are intentionally faking the user agent. > > I have heard these claims a few times, but don't think I have seen any > more in-depth analysis about these - do you happen to have a link with > a more detailed analysis? > > Personally, I am seeing gptbot crawling at a rate of up to about 1 > request per second. On the other hand, I have seen Scrapy-based > crawlers hitting my web sites at full speed over multiple concurrent > connections, but I am not sure these are connected to the AI scrapers. I've been operating an anti-scraper tarpit since mid 2024. In the past hour it's seen 32269 hits from 9582 addresses, presenting 1466 User-Agents (ie, randomized.) A new IP will show up, hit one link deep in the tarpit, and disappear for weeks. These numbers are fairly typical load but I've documented surges upwards of 2500+ new TCP connections per second; I typically end up banning an entire /16 or two to recover my VM when it happens. Other operators I've been in touch with report similar. Whatever the cause, it is definitely true that something is targetting websites, especially those with lots of source code and/or images like photography blogs, with overwhelming distributed force. -- Aaron B. <aa...@zadzmo.org>