This is very useful. Thank you!

We use ContentCafe for image retrieval. We're small enough that I highly
doubt we can afford Cloudflare, which is why we're going this other route.
-Jon

On Mon, Jun 16, 2025 at 9:49 AM Jason Boyer <[email protected]> wrote:

> Since both of those source IPs are from Alibaba (I spend a lot of time on
> whois.arin.net and the other regional registrars) those two at least are
> fake. I've seen a lot of obviously fake user agent strings and referral
> urls (which I think is where https://google.com/ is in those urls). I've
> also seen a lot of presumably hacked residential and business equipment
> used in botnets which usually only make a single search or record retrieval
> request per IP and then another IP will follow up with a different request
> (and never, ever, any js, css, or images), which means there are limits to
> what geo blocking can be used for. I assume these would be related to the
> "third party scrapers" that Anthropic (or whoever) alluded to a long time
> ago when they explained why they didn't respect robots.txt and the wild
> west type of scraping that everyone with a GeForce and a dream are taking
> part in before the bubble bursts.
>
> All that to say that blocking them is fairly hard without going full
> Cloudflare (or similar). One thing we've put together here is this LP:
> https://bugs.launchpad.net/evergreen/+bug/2113979 which will usually just
> throw a 302 at a bot and because they aren't actual browsers they just sort
> of run out of steam while human users may be redirected a single time in a
> session or likely not at all. I complained a lot more about things in that
> ticket so I won't rehash all of that here, but you may be able to lower
> your resource use and spend more time serving real users by trying out that
> patch.
>
> As for your cover 404's, so long as you're not blocking anything from
> internal ranges and aren't blocking outgoing connections that would prevent
> your system from reaching a cover provider those are probably just fine.
> One thing to note, I don't know who you use for cover images, but
> OpenLibrary has lowered their image request limits so much that we really
> should remove them as a provider. Unless you contact them directly there's
> a limit of 10 image retrievals in X time (I don't recall off hand; maybe 1
> or more hours?) and because cover image retrievals are run through the
> server, 1 person loading a search results page will blow up the limit
> immediately.
>
> Jason
>
> --
> Jason Boyer
> Senior System Administrator
> Equinox Open Library Initiative
> [email protected]
> +1 (877) Open-ILS (673-6457)
> https://equinoxOLI.org/
>
>
> On Mon, Jun 16, 2025 at 12:13 PM JonGeorg SageLibrary via
> Evergreen-general <[email protected]> wrote:
>
>> One thing I am seeing a ton of is google.com entries rather than
>> GoogleBot
>>
>> our_domain:443 47.79.206.79 - - [16/Jun/2025:00:00:09 -0700] "GET
>> /eg/opac/record/2620408?query=Fathers%20Juvenile%20fiction HTTP/1.0" 500
>> 21258 "https://www.google.com/"; "Mozilla/5.0 (Linux; Android 10; K)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile
>> Safari/537.36"
>> our_domain:443 47.79.206.22 - - [16/Jun/2025:00:00:08 -0700] "GET
>> /eg/opac/record/2621426?query=Allingham%20William%201824%201889 HTTP/1.0"
>> 500 21258 "https://www.google.com/"; "Mozilla/5.0 (Linux; Android 10; K)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Mobile
>> Safari/537.36"
>>
>> Do you think those are legitimate patron searches or more likely Google
>> scraping in a different way?
>> -Jon
>>
>> On Mon, Jun 16, 2025 at 8:44 AM JonGeorg SageLibrary <
>> [email protected]> wrote:
>>
>>> But that many? I just tried to reboot the app server and it froze on the
>>> advanced key value. I'm wondering if it's unrelated and like you said
>>> normal, and instead the docker managing the SSL cert is locked or something
>>> similar. I've reached out to the people hosting the servers to see if they
>>> have any insight. Thank you!
>>> -Jon
>>>
>>> On Mon, Jun 16, 2025 at 8:41 AM Bill Erickson <[email protected]>
>>> wrote:
>>>
>>>> Hi Jon,
>>>>
>>>> Those would be the patron catalog performing added content lookups.
>>>> Instead of directly reaching out to the vendor for the data, it leverages
>>>> the existing web api via internal requests (in asynchronous batches) to
>>>> collect the data.  Those are expected.
>>>>
>>>> -b
>>>>
>>>>
>>>>
>>>> On Mon, Jun 16, 2025 at 11:20 AM JonGeorg SageLibrary via
>>>> Evergreen-general <[email protected]> wrote:
>>>>
>>>>> Greetings.
>>>>> We've been slammed by bot traffic and had to take counter measures. We
>>>>> geoblocked international traffic at the host firewall level, and recently
>>>>> added a nginx bot blocker for bots based on servers in the US and Canada. 
>>>>> I
>>>>> then scraped bot IPs out of the apache logs and began adding the IPs that
>>>>> were still coming through. Yes, I've updated the robots.txt file- they're
>>>>> ignoring it.
>>>>>
>>>>> The issue is that after a day or two of reprieve, we started getting a
>>>>> ton of 404's with loopback addresses. I've reverted the blacklist config
>>>>> file back to blank, and restarted all services on all servers. We're still
>>>>> getting a ton of traffic that appears to be internally generated.
>>>>>
>>>>> I don't see anything obvious within crontab. Since it appears to be
>>>>> internally generated, the opac stays up longer than it normally would with
>>>>> the number of sessions on the load balancer.
>>>>>
>>>>> Is there an Evergreen or Apache service that indexes the entire
>>>>> catalog? We have our external IP whitelisted. Do internal vlan IP 
>>>>> addresses
>>>>> need whitelisted?
>>>>>
>>>>> Here's an example of the traffic I'm seeing. It's all on port 80 too,
>>>>> external traffic all comes on 443.
>>>>>
>>>>> our_domain:80 127.0.0.1 - - [16/Jun/2025:08:18:31 -0700] "HEAD
>>>>> /opac/extras/ac/anotes/html/r/2621889 HTTP/1.1" 404 159 "-" "-"
>>>>>
>>>>> -Jon
>>>>>
>>>>> _______________________________________________
>>>>> Evergreen-general mailing list --
>>>>> [email protected]
>>>>> To unsubscribe send an email to
>>>>> [email protected]
>>>>>
>>>> _______________________________________________
>> Evergreen-general mailing list --
>> [email protected]
>> To unsubscribe send an email to
>> [email protected]
>>
>
_______________________________________________
Evergreen-general mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to