Hi Alex,

On Fri, Apr 11, 2025 at 01:41:10AM +0200, Aleksandar Lazic wrote:
> Hi.
> 
> I thought to contribute to the
> https://github.com/ai-robots-txt/ai.robots.txt repo and would like to here
> your opinion to the Question.
> 
> Should the AI Crawler be
> [ ] tarpitted or get directly a
> [ ] deny?
> 
> Both lines are added in the example config to see that the config is almost
> the same on HAProxy site.
> 
> https://github.com/git001/ai.robots.txt/blob/7169417be76d8f6e8ca69593f626ca24814cf3a2/haproxy-ai-crawler-block.config#L36-L37

I think it's generally a bad idea to simply block requests based on
approximate matches, because there's always a risk of false positive
that will completely block access to valid visitors without giving
them a solution to figure what's wrong or to contact anyone.

IMHO a better approach is to slow the requests down significantly,
or to only block them past a certain number of requests proving that
they're abusing. For example you could use track-sc on the user-agent
if it matches the suspicious ones and block after a few tens of
requests or if the request rate is too high. This would at least
leave occasional visitors that are unlucky enough to match one of
the regex to pass through.

Also another point to keep in mind is to never let someone else dictate
to you who's allowed to access your infrastructure. That's the problem
that RBLs pose for example, and any successful pattern collection
service should be used for limiting instead of blocking, precisely to
make sure they don't go crazy and that even if they do, you're not
keeping your visitors at the door.

> I know that HAProxy Enterprise have a great bot-management Solution
> https://www.haproxy.com/solutions/bot-management
> 
> which is mentioned in these Blog Posts
> 
> https://www.haproxy.com/blog/how-to-reliably-block-ai-crawlers-using-haproxy-enterprise
> https://www.haproxy.com/blog/nearly-90-of-our-ai-crawler-traffic-is-from-tiktok-parent-bytedance-lessons-learned
> 
> but maybe the ai.robots.txt can be a poor person solution wich can also be
> used for the Haproxy Ingress Controller?

Yes, possibly! Thanks for the link by the way! I noticed as well on
haproxy.org that the traffic increased by about 20-30% over the last
year, and when I got the time to perform some analysis, there was
indeed 30-40% of user-agents having the "AI" or "GPT" word in them...

I might give that one a try once I have some time.

Cheers,
Willy


Reply via email to