Yeah, I'm not seeing any /opac/extras/unapi requests in the Apache logs. Is DorkBot used legitimately for querying the opac? -Jon
On Fri, Dec 3, 2021 at 10:37 AM JonGeorg SageLibrary < [email protected]> wrote: > Thank you! > -Jon > > On Fri, Dec 3, 2021 at 8:10 AM Blake Henderson via Evergreen-general < > [email protected]> wrote: > >> JonGeorg, >> >> This reminds me of a similar issues that we had. We resolved it with this >> change to NGINX. Here's the link: >> >> >> https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/blake/LP1913610_nginx_request_limits >> >> and the bug: >> https://bugs.launchpad.net/evergreen/+bug/1913610 >> >> I'm not sure that it's the same issue though, as you've shared a search >> SQL query and this solution addresses external requests to >> "/opac/extras/unapi" >> But you might be able to apply the same nginx rate limiting technique >> here if you can detect the URL they are using. >> >> There is a tool called "apachetop" which I used in order to see the URL's >> that were being used. >> >> apt-get -y install apachetop && apachetop -f >> /var/log/apache2/other_vhosts_access.log >> >> and another useful command: >> >> cat /var/log/apache2/other_vhosts_access.log | awk '{print $2}' | sort | >> uniq -c | sort -rn >> >> You have to ignore (not limit) all the requests to the Evergreen gateway >> as most of that traffic is the staff client and should (probably) not be >> limited. >> >> I'm just throwing some ideas out there for you. Good luck! >> >> -Blake- >> Conducting Magic >> Can consume data in any format >> MOBIUS >> >> On 12/2/2021 9:07 PM, JonGeorg SageLibrary via Evergreen-general wrote: >> >> I tried that and still got the loopback address, after restarting >> services. Any other ideas? And the robots.txt file seems to be doing >> nothing, which is not much of a surprise. I've reached out to the people >> who host our network and have control of everything on the other side of >> the firewall. >> -Jon >> >> >> On Wed, Dec 1, 2021 at 3:57 AM Jason Stephenson <[email protected]> wrote: >> >>> JonGeorg, >>> >>> If you're using nginx as a proxy, that may be the configuration of >>> Apache and nginx. >>> >>> First, make sure that mod_remote_ip is installed and enabled for Apache >>> 2. >>> >>> Then, in eg_vhost.conf, find the 3 lines the begin with >>> "RemoteIPInternalProxy 127.0.0.1/24" and uncomment them. >>> >>> Next, see what header Apache checks for the remote IP address. In my >>> example it is "RemoteIPHeader X-Forwarded-For" >>> >>> Next, make sure that the following two lines appear in BOTH "location /" >>> blocks in the ngins configuration: >>> >>> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; >>> proxy_set_header X-Forwarded-Proto $scheme; >>> >>> After reloading/restarting nginx and Apache, you should start seeing >>> remote IP addresses in the Apache logs. >>> >>> Hope that helps! >>> Jason >>> >>> >>> On 12/1/21 12:53 AM, JonGeorg SageLibrary wrote: >>> > Because we're behind a firewall, all the addresses display as >>> 127.0.0.1. >>> > I can talk to the people who administer the firewall though about >>> > blocking IP's. Thanks >>> > -Jon >>> > >>> > On Tue, Nov 30, 2021 at 8:20 PM Jason Stephenson via Evergreen-general >>> > <[email protected] >>> > <mailto:[email protected]>> wrote: >>> > >>> > JonGeorg, >>> > >>> > Check your Apache logs for the source IP addresses. If you can't >>> find >>> > them, I can share the correct configuration for Apache with Nginx >>> so >>> > that you will get the addresses logged. >>> > >>> > Once you know the IP address ranges, block them. If you have a >>> > firewall, >>> > I suggest you block them there. If not, you can block them in >>> Nginx or >>> > in your load balancer configuration if you have one and it allows >>> that. >>> > >>> > You may think you want your catalog to show up in search engines, >>> but >>> > bad bots will lie about who they are. All you can do with >>> misbehaving >>> > bots is to block them. >>> > >>> > HtH, >>> > Jason >>> > >>> > On 11/30/21 9:34 PM, JonGeorg SageLibrary via Evergreen-general >>> wrote: >>> > > Question. We've been getting hammered by search engine bots >>> [?], but >>> > > they seem to all query our system at the same time. Enough that >>> it's >>> > > crashing the app servers. We have a robots.txt file in place. >>> I've >>> > > increased the crawling delay speed from 3 to 10 seconds, and >>> have >>> > > explicitly disallowed the specific bots, but I've seen no change >>> > from >>> > > the worst offenders - Bingbot and UT-Dorkbot. We had over 4k >>> hits >>> > from >>> > > Dorkbot alone from 2pm-5pm today, and over 5k from Bingbot in >>> the >>> > same >>> > > timeframe. All a couple hours after I made the changes to the >>> robots >>> > > file and restarted apache services. Which out of 100k entries >>> in the >>> > > vhosts files in that time frame doesn't sound like a lot, but >>> the >>> > rest >>> > > of the traffic looks normal. This issue has been happening >>> > > intermittently [last 3 are 11/30, 11/3, 7/20] for a while, and >>> > the only >>> > > thing that seems to work is to manually kill the services on >>> the DB >>> > > servers and restart services on the application servers. >>> > > >>> > > The symptom is an immediate spike in the Database CPU load. I >>> start >>> > > killing all queries older than 2 minutes, but it still usually >>> > > overwhelms the system causing the app servers to stop serving >>> > requests. >>> > > The stuck queries are almost always ones along the lines of: >>> > > >>> > > -- bib search: #CD_documentLength #CD_meanHarmonic >>> #CD_uniqueWords >>> > > from_metarecord(*/BIB_RECORD#/*) core_limit(100000) >>> > > badge_orgs(1,138,151) estimation_strategy(inclusion) >>> skip_check(0) >>> > > check_limit(1000) sort(1) filter_group_entry(1) 1 >>> > > site(*/LIBRARY_BRANCH/*) depth(2) >>> > > + >>> > > | | WITH w AS ( >>> > > | | WITH */STRING/*_keyword_xq AS (SELECT >>> > > + >>> > > | | (to_tsquery('english_nostop', >>> > > COALESCE(NULLIF( '(' || >>> > > >>> > >>> >>> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), >>> > >>> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), >>> > '')) || >>> > > to_tsquery('simple', COALESCE(NULLIF( '(' || >>> > > >>> > >>> >>> btrim(regexp_replace(split_date_range(search_normalize(replace(replace(uppercase(translate_isbn1013(E'1')), >>> > >>> > > */LONG_STRING/*))),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), >>> > ''))) AS >>> > > tsq,+ >>> > > | | (to_tsquery('english_nostop', >>> > > COALESCE(NULLIF( '(' || >>> > > btrim(regexp_replace(split_date_range(search_normalize >>> > > 00:02:17.319491 | */STRING/* | >>> > > >>> > > And the queries by DorkBot look like they could be starting the >>> > query >>> > > since it's using the basket function in the OPAC. >>> > > >>> > > "GET >>> > > >>> > >>> >>> /eg/opac/results?do_basket_action=Go&query=1&detail_record_view=*/LONG_STRING/*&search-submit-go=Search&no_highlight=1&modifier=metabib&select_basket_action=1&qtype=keyword&fg%3Amat_format=1&locg=112&sort=1 >>> > >>> > > HTTP/1.0" 500 16796 "-" "UT-Dorkbot/1.0" >>> > > >>> > > I've anonymized the output just to be cautious. Reports are run >>> > off the >>> > > backup database server, so it cannot be an auto generated >>> > report, and it >>> > > doesn't happen often enough for that either. At this point I'm >>> > tempted >>> > > to block the IP addresses. What strategies are you all using >>> to deal >>> > > with crawlers, and does anyone have an idea what is causing >>> this? >>> > > -Jon >>> > > >>> > > _______________________________________________ >>> > > Evergreen-general mailing list >>> > > [email protected] >>> > <mailto:[email protected]> >>> > > >>> > >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>> > < >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>> > >>> > > >>> > _______________________________________________ >>> > Evergreen-general mailing list >>> > [email protected] >>> > <mailto:[email protected]> >>> > >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>> > < >>> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >>> > >>> > >>> >> >> _______________________________________________ >> Evergreen-general mailing >> [email protected]http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >> >> >> _______________________________________________ >> Evergreen-general mailing list >> [email protected] >> http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general >> >
_______________________________________________ Evergreen-general mailing list [email protected] http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-general
