This is very useful. Thank you! We use ContentCafe for image retrieval. We're small enough that I highly doubt we can afford Cloudflare, which is why we're going this other route. -Jon
On Mon, Jun 16, 2025 at 9:49 AM Jason Boyer <[email protected]> wrote: > Since both of those source IPs are from Alibaba (I spend a lot of time on > whois.arin.net and the other regional registrars) those two at least are > fake. I've seen a lot of obviously fake user agent strings and referral > urls (which I think is where https://google.com/ is in those urls). I've > also seen a lot of presumably hacked residential and business equipment > used in botnets which usually only make a single search or record retrieval > request per IP and then another IP will follow up with a different request > (and never, ever, any js, css, or images), which means there are limits to > what geo blocking can be used for. I assume these would be related to the > "third party scrapers" that Anthropic (or whoever) alluded to a long time > ago when they explained why they didn't respect robots.txt and the wild > west type of scraping that everyone with a GeForce and a dream are taking > part in before the bubble bursts. > > All that to say that blocking them is fairly hard without going full > Cloudflare (or similar). One thing we've put together here is this LP: > https://bugs.launchpad.net/evergreen/+bug/2113979 which will usually just > throw a 302 at a bot and because they aren't actual browsers they just sort > of run out of steam while human users may be redirected a single time in a > session or likely not at all. I complained a lot more about things in that > ticket so I won't rehash all of that here, but you may be able to lower > your resource use and spend more time serving real users by trying out that > patch. > > As for your cover 404's, so long as you're not blocking anything from > internal ranges and aren't blocking outgoing connections that would prevent > your system from reaching a cover provider those are probably just fine. > One thing to note, I don't know who you use for cover images, but > OpenLibrary has lowered their image request limits so much that we really > should remove them as a provider. Unless you contact them directly there's > a limit of 10 image retrievals in X time (I don't recall off hand; maybe 1 > or more hours?) and because cover image retrievals are run through the > server, 1 person loading a search results page will blow up the limit > immediately. > > Jason > > -- > Jason Boyer > Senior System Administrator > Equinox Open Library Initiative > [email protected] > +1 (877) Open-ILS (673-6457) > https://equinoxOLI.org/ > > > On Mon, Jun 16, 2025 at 12:13 PM JonGeorg SageLibrary via > Evergreen-general <[email protected]> wrote: > >> One thing I am seeing a ton of is google.com entries rather than >> GoogleBot >> >> our_domain:443 47.79.206.79 - - [16/Jun/2025:00:00:09 -0700] "GET >> /eg/opac/record/2620408?query=Fathers%20Juvenile%20fiction HTTP/1.0" 500 >> 21258 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) >> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile >> Safari/537.36" >> our_domain:443 47.79.206.22 - - [16/Jun/2025:00:00:08 -0700] "GET >> /eg/opac/record/2621426?query=Allingham%20William%201824%201889 HTTP/1.0" >> 500 21258 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) >> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Mobile >> Safari/537.36" >> >> Do you think those are legitimate patron searches or more likely Google >> scraping in a different way? >> -Jon >> >> On Mon, Jun 16, 2025 at 8:44 AM JonGeorg SageLibrary < >> [email protected]> wrote: >> >>> But that many? I just tried to reboot the app server and it froze on the >>> advanced key value. I'm wondering if it's unrelated and like you said >>> normal, and instead the docker managing the SSL cert is locked or something >>> similar. I've reached out to the people hosting the servers to see if they >>> have any insight. Thank you! >>> -Jon >>> >>> On Mon, Jun 16, 2025 at 8:41 AM Bill Erickson <[email protected]> >>> wrote: >>> >>>> Hi Jon, >>>> >>>> Those would be the patron catalog performing added content lookups. >>>> Instead of directly reaching out to the vendor for the data, it leverages >>>> the existing web api via internal requests (in asynchronous batches) to >>>> collect the data. Those are expected. >>>> >>>> -b >>>> >>>> >>>> >>>> On Mon, Jun 16, 2025 at 11:20 AM JonGeorg SageLibrary via >>>> Evergreen-general <[email protected]> wrote: >>>> >>>>> Greetings. >>>>> We've been slammed by bot traffic and had to take counter measures. We >>>>> geoblocked international traffic at the host firewall level, and recently >>>>> added a nginx bot blocker for bots based on servers in the US and Canada. >>>>> I >>>>> then scraped bot IPs out of the apache logs and began adding the IPs that >>>>> were still coming through. Yes, I've updated the robots.txt file- they're >>>>> ignoring it. >>>>> >>>>> The issue is that after a day or two of reprieve, we started getting a >>>>> ton of 404's with loopback addresses. I've reverted the blacklist config >>>>> file back to blank, and restarted all services on all servers. We're still >>>>> getting a ton of traffic that appears to be internally generated. >>>>> >>>>> I don't see anything obvious within crontab. Since it appears to be >>>>> internally generated, the opac stays up longer than it normally would with >>>>> the number of sessions on the load balancer. >>>>> >>>>> Is there an Evergreen or Apache service that indexes the entire >>>>> catalog? We have our external IP whitelisted. Do internal vlan IP >>>>> addresses >>>>> need whitelisted? >>>>> >>>>> Here's an example of the traffic I'm seeing. It's all on port 80 too, >>>>> external traffic all comes on 443. >>>>> >>>>> our_domain:80 127.0.0.1 - - [16/Jun/2025:08:18:31 -0700] "HEAD >>>>> /opac/extras/ac/anotes/html/r/2621889 HTTP/1.1" 404 159 "-" "-" >>>>> >>>>> -Jon >>>>> >>>>> _______________________________________________ >>>>> Evergreen-general mailing list -- >>>>> [email protected] >>>>> To unsubscribe send an email to >>>>> [email protected] >>>>> >>>> _______________________________________________ >> Evergreen-general mailing list -- >> [email protected] >> To unsubscribe send an email to >> [email protected] >> >
_______________________________________________ Evergreen-general mailing list -- [email protected] To unsubscribe send an email to [email protected]
