Hi Alex, On Fri, Apr 11, 2025 at 01:41:10AM +0200, Aleksandar Lazic wrote: > Hi. > > I thought to contribute to the > https://github.com/ai-robots-txt/ai.robots.txt repo and would like to here > your opinion to the Question. > > Should the AI Crawler be > [ ] tarpitted or get directly a > [ ] deny? > > Both lines are added in the example config to see that the config is almost > the same on HAProxy site. > > https://github.com/git001/ai.robots.txt/blob/7169417be76d8f6e8ca69593f626ca24814cf3a2/haproxy-ai-crawler-block.config#L36-L37
I think it's generally a bad idea to simply block requests based on approximate matches, because there's always a risk of false positive that will completely block access to valid visitors without giving them a solution to figure what's wrong or to contact anyone. IMHO a better approach is to slow the requests down significantly, or to only block them past a certain number of requests proving that they're abusing. For example you could use track-sc on the user-agent if it matches the suspicious ones and block after a few tens of requests or if the request rate is too high. This would at least leave occasional visitors that are unlucky enough to match one of the regex to pass through. Also another point to keep in mind is to never let someone else dictate to you who's allowed to access your infrastructure. That's the problem that RBLs pose for example, and any successful pattern collection service should be used for limiting instead of blocking, precisely to make sure they don't go crazy and that even if they do, you're not keeping your visitors at the door. > I know that HAProxy Enterprise have a great bot-management Solution > https://www.haproxy.com/solutions/bot-management > > which is mentioned in these Blog Posts > > https://www.haproxy.com/blog/how-to-reliably-block-ai-crawlers-using-haproxy-enterprise > https://www.haproxy.com/blog/nearly-90-of-our-ai-crawler-traffic-is-from-tiktok-parent-bytedance-lessons-learned > > but maybe the ai.robots.txt can be a poor person solution wich can also be > used for the Haproxy Ingress Controller? Yes, possibly! Thanks for the link by the way! I noticed as well on haproxy.org that the traffic increased by about 20-30% over the last year, and when I got the time to perform some analysis, there was indeed 30-40% of user-agents having the "AI" or "GPT" word in them... I might give that one a try once I have some time. Cheers, Willy