Since both of those source IPs are from Alibaba (I spend a lot of time on
whois.arin.net and the other regional registrars) those two at least are
fake. I've seen a lot of obviously fake user agent strings and referral
urls (which I think is where https://google.com/ is in those urls). I've
also seen a lot of presumably hacked residential and business equipment
used in botnets which usually only make a single search or record retrieval
request per IP and then another IP will follow up with a different request
(and never, ever, any js, css, or images), which means there are limits to
what geo blocking can be used for. I assume these would be related to the
"third party scrapers" that Anthropic (or whoever) alluded to a long time
ago when they explained why they didn't respect robots.txt and the wild
west type of scraping that everyone with a GeForce and a dream are taking
part in before the bubble bursts.

All that to say that blocking them is fairly hard without going full
Cloudflare (or similar). One thing we've put together here is this LP:
https://bugs.launchpad.net/evergreen/+bug/2113979 which will usually just
throw a 302 at a bot and because they aren't actual browsers they just sort
of run out of steam while human users may be redirected a single time in a
session or likely not at all. I complained a lot more about things in that
ticket so I won't rehash all of that here, but you may be able to lower
your resource use and spend more time serving real users by trying out that
patch.

As for your cover 404's, so long as you're not blocking anything from
internal ranges and aren't blocking outgoing connections that would prevent
your system from reaching a cover provider those are probably just fine.
One thing to note, I don't know who you use for cover images, but
OpenLibrary has lowered their image request limits so much that we really
should remove them as a provider. Unless you contact them directly there's
a limit of 10 image retrievals in X time (I don't recall off hand; maybe 1
or more hours?) and because cover image retrievals are run through the
server, 1 person loading a search results page will blow up the limit
immediately.

Jason

-- 
Jason Boyer
Senior System Administrator
Equinox Open Library Initiative
[email protected]
+1 (877) Open-ILS (673-6457)
https://equinoxOLI.org/


On Mon, Jun 16, 2025 at 12:13 PM JonGeorg SageLibrary via Evergreen-general
<[email protected]> wrote:

> One thing I am seeing a ton of is google.com entries rather than GoogleBot
>
> our_domain:443 47.79.206.79 - - [16/Jun/2025:00:00:09 -0700] "GET
> /eg/opac/record/2620408?query=Fathers%20Juvenile%20fiction HTTP/1.0" 500
> 21258 "https://www.google.com/"; "Mozilla/5.0 (Linux; Android 10; K)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile
> Safari/537.36"
> our_domain:443 47.79.206.22 - - [16/Jun/2025:00:00:08 -0700] "GET
> /eg/opac/record/2621426?query=Allingham%20William%201824%201889 HTTP/1.0"
> 500 21258 "https://www.google.com/"; "Mozilla/5.0 (Linux; Android 10; K)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Mobile
> Safari/537.36"
>
> Do you think those are legitimate patron searches or more likely Google
> scraping in a different way?
> -Jon
>
> On Mon, Jun 16, 2025 at 8:44 AM JonGeorg SageLibrary <
> [email protected]> wrote:
>
>> But that many? I just tried to reboot the app server and it froze on the
>> advanced key value. I'm wondering if it's unrelated and like you said
>> normal, and instead the docker managing the SSL cert is locked or something
>> similar. I've reached out to the people hosting the servers to see if they
>> have any insight. Thank you!
>> -Jon
>>
>> On Mon, Jun 16, 2025 at 8:41 AM Bill Erickson <[email protected]> wrote:
>>
>>> Hi Jon,
>>>
>>> Those would be the patron catalog performing added content lookups.
>>> Instead of directly reaching out to the vendor for the data, it leverages
>>> the existing web api via internal requests (in asynchronous batches) to
>>> collect the data.  Those are expected.
>>>
>>> -b
>>>
>>>
>>>
>>> On Mon, Jun 16, 2025 at 11:20 AM JonGeorg SageLibrary via
>>> Evergreen-general <[email protected]> wrote:
>>>
>>>> Greetings.
>>>> We've been slammed by bot traffic and had to take counter measures. We
>>>> geoblocked international traffic at the host firewall level, and recently
>>>> added a nginx bot blocker for bots based on servers in the US and Canada. I
>>>> then scraped bot IPs out of the apache logs and began adding the IPs that
>>>> were still coming through. Yes, I've updated the robots.txt file- they're
>>>> ignoring it.
>>>>
>>>> The issue is that after a day or two of reprieve, we started getting a
>>>> ton of 404's with loopback addresses. I've reverted the blacklist config
>>>> file back to blank, and restarted all services on all servers. We're still
>>>> getting a ton of traffic that appears to be internally generated.
>>>>
>>>> I don't see anything obvious within crontab. Since it appears to be
>>>> internally generated, the opac stays up longer than it normally would with
>>>> the number of sessions on the load balancer.
>>>>
>>>> Is there an Evergreen or Apache service that indexes the entire
>>>> catalog? We have our external IP whitelisted. Do internal vlan IP addresses
>>>> need whitelisted?
>>>>
>>>> Here's an example of the traffic I'm seeing. It's all on port 80 too,
>>>> external traffic all comes on 443.
>>>>
>>>> our_domain:80 127.0.0.1 - - [16/Jun/2025:08:18:31 -0700] "HEAD
>>>> /opac/extras/ac/anotes/html/r/2621889 HTTP/1.1" 404 159 "-" "-"
>>>>
>>>> -Jon
>>>>
>>>> _______________________________________________
>>>> Evergreen-general mailing list --
>>>> [email protected]
>>>> To unsubscribe send an email to
>>>> [email protected]
>>>>
>>> _______________________________________________
> Evergreen-general mailing list -- [email protected]
> To unsubscribe send an email to
> [email protected]
>
_______________________________________________
Evergreen-general mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to