You might skip mod_sec and do the detection with fail2ban's apache-badbots, by changing its regex to  (the spaces ARE important, copy and paste that):

failregex = ^(?:\S+:\d+ )?<ADDR> [^"]*"[A-Z]+ [^"]+" \d+ \d+ "[^"]*" "[^"]*(?:<badbots>|<badbotscustom>)[^"]*"

adding the bad bots to the start of the "badbots" regex like:

badbots = meta-externalagent|facebookexternalhit|SemrushBot|amazonbot|AmazonBot|ClaudeBot|claudebot|Atomic_Email_Hunter/4\.0| ... rest of the regex stays here.

and adding a jail like this:

[apache-badbots]
enabled = true
port     = http,https
filter   = apache-badbots
bantime  = 48h
logpath  = %(apache_access_log)s
maxretry = 1

[apache-badbots2]
enabled = true
port     = http,https
filter   = apache-badbots
bantime  = 48h
logpath  = /var/log/koha/USEYOURKOHASITENAMEHERE/plack.log
maxretry = 1

On 7/25/24 10:15, Indranil Das Gupta wrote:
Hi Nigel,

My solution for that is simple two step process:

1) using mod_sec to monitor and match the UA string of the incoming request
against a list of UAs I don't want and return a HTTP 406 if the UA matches
for the first time.

2) Have fail2ban monitor the apache log for 406 and immediately ban the IP
(IPv4 / IPv6) for 96 hours using an apache-badbots jail.

This strategy has so far managed to keep my servers "cool".

cheers
-idg


On Thu, Jul 25, 2024, 16:57 Nigel Titley<ni...@titley.com>  wrote:

Is anyone else getting problems with the facebook web crawler hammering
their OPAC search function?

This has been happening on and off for a couple of months but set in
with a vengeance a couple of days ago. The crawler is hitting us with
many OPAC search queries, beyond the capacity of our system to respond.

robots.txt is being ignored

I started by blocking facebook's entire IPv6 range as the queries were
all coming in over IPv6. They responded by switching to IPv4 and because
they have a number of blocks it wasn't practical to block each and every
one of them.

I've temporarily switched off OPAC entirely and the system has returned
to normal and I can at least perform intranet functions but this is
obviously non-ideal.

Does anyone have any thoughts on this?

I'm running 22.05.13.000 on Ubuntu.

Thanks

Nigel
_______________________________________________

Koha mailing listhttp://koha-community.org
Koha@lists.katipo.co.nz
Unsubscribe:https://lists.katipo.co.nz/mailman/listinfo/koha

_______________________________________________

Koha mailing listhttp://koha-community.org
Koha@lists.katipo.co.nz
Unsubscribe:https://lists.katipo.co.nz/mailman/listinfo/koha

--
Hector Gonzalez
ca...@genac.org
_______________________________________________

Koha mailing list  http://koha-community.org
Koha@lists.katipo.co.nz
Unsubscribe: https://lists.katipo.co.nz/mailman/listinfo/koha

Reply via email to