Hi Willy.

Thank you for your answer, more detailed answer from me below.

On Fri, 11 Apr 2025 19:45:45 +0200 Willy Tarreau <w...@1wt.eu> wrote:

> Hi Alex,
> 
> On Fri, Apr 11, 2025 at 01:41:10AM +0200, Aleksandar Lazic wrote:
> > Hi.
> > 
> > I thought to contribute to the
> > https://github.com/ai-robots-txt/ai.robots.txt repo and would like
> > to here your opinion to the Question.
> > 
> > Should the AI Crawler be
> > [ ] tarpitted or get directly a
> > [ ] deny?
> > 
> > Both lines are added in the example config to see that the config
> > is almost the same on HAProxy site.
> > 
> > https://github.com/git001/ai.robots.txt/blob/7169417be76d8f6e8ca69593f626ca24814cf3a2/haproxy-ai-crawler-block.config#L36-L37
> 
> I think it's generally a bad idea to simply block requests based on
> approximate matches, because there's always a risk of false positive
> that will completely block access to valid visitors without giving
> them a solution to figure what's wrong or to contact anyone.
> 
> IMHO a better approach is to slow the requests down significantly,
> or to only block them past a certain number of requests proving that
> they're abusing. For example you could use track-sc on the user-agent
> if it matches the suspicious ones and block after a few tens of
> requests or if the request rate is too high. This would at least
> leave occasional visitors that are unlucky enough to match one of
> the regex to pass through.
> 
> Also another point to keep in mind is to never let someone else
> dictate to you who's allowed to access your infrastructure. That's
> the problem that RBLs pose for example, and any successful pattern
> collection service should be used for limiting instead of blocking,
> precisely to make sure they don't go crazy and that even if they do,
> you're not keeping your visitors at the door.

I partly agree because the "attackers", we can call them also "AI
Crawlers", behave like "normal" Users which makes the restrictions quite
difficult. From what I have read in the Internet ignore these tools the
"robots.txt" which have also mainly User-Agents in there. So the UA
string is IMHO a good starting point as any other. There are some other
defending strategies which blocks IP ranges or even whole ASN which
could also be a starting point, IMHO.

I agree in the point that there could be some side effects which could
hit some "normal" Users, that's the reason why every site owner have to
decide which defend strategy fits best for there site.

From your feedback looks `tarpit` more suitable :-)

Of course the return can also be a JS which will be handled by
the users browser and not yet from the AI Robots, is the similar what
for example Anubis is doing, but I'm pretty sure that the AI Crawler
Companies will learn to handle JS similar to some load testing tools
and then is the JS workaround not that effective anymore as it's today.

The best way would be when the AI Companies behave like good net
citizen and respect the "robots.txt" and use unique UA for there
Crawler but that will crash there business, just because of that point
of view I don't expect that the AI Companies will be anytime soon good
net citizen :-)

> > I know that HAProxy Enterprise have a great bot-management Solution
> > https://www.haproxy.com/solutions/bot-management
> > 
> > which is mentioned in these Blog Posts
> > 
> > https://www.haproxy.com/blog/how-to-reliably-block-ai-crawlers-using-haproxy-enterprise
> > https://www.haproxy.com/blog/nearly-90-of-our-ai-crawler-traffic-is-from-tiktok-parent-bytedance-lessons-learned
> > 
> > but maybe the ai.robots.txt can be a poor person solution wich can
> > also be used for the Haproxy Ingress Controller?
> 
> Yes, possibly! Thanks for the link by the way! I noticed as well on
> haproxy.org that the traffic increased by about 20-30% over the last
> year, and when I got the time to perform some analysis, there was
> indeed 30-40% of user-agents having the "AI" or "GPT" word in them...
> 
> I might give that one a try once I have some time.

As I have not any interesting sites out there would it be nice to see
if the regex changes anything for haproxy.org of course only if you
find some time. I know the 3.2 will be released soon so no rush on
that topic.

> Cheers,
> Willy

Regards
Alex


Reply via email to