Identifying the crawlers is (almost) as simple as three ifs in a trenchcoat: 
https://chronicles.mad-scientist.club/tales/surviving-the-crawlers/#three-ifs-in-a-trenchcoat

Algernon / Gergely has had to tweak the detection used by iocaine powder 
recently, but he’s also released “iocaine powder as a service” for static 
websites. I haven’t hooked it up on my web presence yet because I have too many 
hobbies and not enough time, but my impression is that he would be open to 
helping if someone wanted to hook up iocaine to an existing website.

https://chronicles.mad-scientist.club/tales/only-junk-fans/ has the 
announcement.

As for filtering by ip address, I found on my test website that blocking AWS, 
Azure, Digital Ocean, and a few other “cloud providers” stopped most of the AI 
crawlers, but the ones that got through after I blocked those were using an 
automated chrome running from residential IP addresses, and were even more 
aggressive about crawling than the ones coming from the cloud providers. These 
are nasty, because they are real browsers from real residential addresses, 
using some sort of browser extension to do the crawling unbeknownst to the 
human user.

-DaveP

On Fri, Mar 6, 2026, at 12:01, Peter G. wrote:
> On 04/03/2026 04:16, Nick Holland wrote:
>> cvsweb does not care about your browser.  It only cares about the IP
>> address.
>
> maybe it should care, tho
>
> simple user agent filtering will already go far, i have several systems
> under heavy bot traffic and most AI bots use either specific user agent
> headers, or broken/empty ones
>
> create a matching list of most common browser user-agents headers, and
> match that against the traffic
>
> on desktops webkit will lead the charge in that regard,
> safari/chrome/opera will account to some 70-75% of the traffic, with
> firefox following with 8-10%
>
> on mobile webkit will show 85-90%, firefox around 6%
>
> ...which means you will only need several regex entries to handle almost
> all the legit traffic
>
> nginx map dynamic regex will go far here, nginx map proves to be
> extremely high performing
>
> speaking from experience here, running a project dealing with 1000-25000
> RPS during calmer hours up to 10k RPS during busy hours every day, user
> agent whitelisting helps a lot

Reply via email to