JonGeorg,
If you're using nginx as a proxy, that may be the configuration of
Apache and nginx.
First, make sure that mod_remote_ip is installed and enabled for Apache 2.
Then, in eg_vhost.conf, find the 3 lines the begin with
"RemoteIPInternalProxy 127.0.0.1/24" and uncomment them.
Next, see what header Apache checks for the remote IP address. In my
example it is "RemoteIPHeader X-Forwarded-For"
Next, make sure that the following two lines appear in BOTH "location /"
blocks in the ngins configuration:
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
After reloading/restarting nginx and Apache, you should start seeing
remote IP addresses in the Apache logs.
Hope that helps!
Jason
On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote:
Because we're behind a firewall, all the addresses display as 127.0.0.1.
I can talk to the people who administer the firewall though about
blocking IP's. Thanks
-Jon
On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general
<[email protected]
<mailto:[email protected]>> wrote:
JonGeorg,
Check your Apache logs for the source IP addresses. If you can't find
them, I can share the correct configuration for Apache with Nginx so
that you will get the addresses logged.
Once you know the IP address ranges, block them. If you have a
firewall,
I suggest you block them there. If not, you can block them in Nginx or
in your load balancer configuration if you have one and it allows that.
You may think you want your catalog to show up in search engines, but
bad bots will lie about who they are. All you can do with misbehaving
bots is to block them.
HtH,
Jason
On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general wrote:
> Question. We've been getting hammered by search engine bots [?], but
> they seem to all query our system at the same time. Enough that it's
> crashing the app servers. We have a robots.txt file in place. I've
> increased the crawling delay speed from 3 to 10 seconds, and have
> explicitly disallowed the specific bots, but I've seen no change
from
> the worst offenders - Bingbot and UT-Dorkbot. We had over 4k hits
from
> Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in the
same
> timeframe. All a couple hours after I made the changes to the robots
> file and restarted apache services. Which out of 100k entries in the
> vhosts files in that time frame doesn't sound like a lot, but the
rest
> of the traffic looks normal. This issue has been happening
> intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and
the only
> thing that seems to work is to manually kill the services on the DB
> servers and restart services on the application servers.
>
> The symptom is an immediate spike in the Database CPU load. I start
> killing all queries older than 2 minutes, but it still usually
> overwhelms the system causing the app servers to stop serving
requests.
> The stuck queries are almost always ones along the lines of:
>
> -- bib search: #CD_documentLength #CD_meanHarmonic #CD_uniqueWords
> from_metarecord(*/BIB_RECORD#/*) core_limit(100000)
> badge_orgs(1,138,151) estimation_strategy(inclusion) skip_check(0)
> check_limit(1000) sort(1) filter_group_entry(1) 1
> site(*/LIBRARY_BRANCH/*) depth(2)
> +
> | | WITH w AS (
> | | WITH */STRING/*_keyword_xq AS (SELECT
> +
> | | (to_tsquery('english_nostop',
> COALESCE(NULLIF( '(' ||
>
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
> */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),
'')) ||
> to_tsquery('simple', COALESCE(NULLIF( '(' ||
>
btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')),
> */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'),
''))) AS
> tsq,+
> | | (to_tsquery('english_nostop',
> COALESCE(NULLIF( '(' ||
> btrim(regexp_replace(split_date_range(search_normalize
> 00:02:17.319491 | */STRING/* |
>
> And the queries by DorkBot look like they could be starting the
query
> since it's using the basket function in the OPAC.
>
> "GET
>
/eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1
> HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0"
>
> I've anonymized the output just to be cautious. Reports are run
off the
> backup database server, so it cannot be an auto generated
report, and it
> doesn't happen often enough for that either. At this point I'm
tempted
> to block the IP addresses. What strategies are you all using to deal
> with crawlers, and does anyone have an idea what is causing this?
> -Jon
>
> _______________________________________________
> Evergreen-general mailing list
> [email protected]
<mailto:[email protected]>
>
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
<http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
>
_______________________________________________
Evergreen-general mailing list
[email protected]
<mailto:[email protected]>
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
<http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general>
_______________________________________________
Evergreen-general mailing list
[email protected]
http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general