Mladen,

On 12/16/20 10:12, Mladen Adamović wrote:
On Wed, Dec 16, 2020 at 3:27 PM Christopher Schultz <ch...@christopherschultz.net <mailto:ch...@christopherschultz.net>> wrote:

     > We have a self-monitoring script which runs on server and when
    the server
     > is not working properly it does a log save and the service restart.

    How do you detect this state? Just make a request and if you get "No
    data received" from curl, you restart the server?


If there is an error code or the specific text doesn't appear on the response we monitor the state and do /etc/init.d/tomcat restart.
The full script is:
#!/bin/bash
serverFailure=0
cd /root
rm /root/numbeo_test.out
#wget -t 1 -T 5 --no-proxy --no-cache --cache=off -q 'localhost:8080/cost-of-living/city_result.jsp?country=Ireland&city=Dublin' -O /root/numbeo_test.out #curl -L -m 2 -v  -o /root/numbeo_test.out --trace curl.log 'localhost:8008/cost-of-living/in/Dublin' curl -L -m 2 -v --insecure -o /root/numbeo_test.out --trace curl.log 'https://localhost:8181/cost-of-living/in/Dublin <https://localhost:8181/cost-of-living/in/Dublin>'
wgetOutput=$?

grep -q "entries in the past" /root/numbeo_test.out
if [ $? != 0 ]; then
cd /root
rm /root/numbeo_test.out
sleep 10s
#wget -t 2 -T 2 --no-proxy --no-cache --cache=off -q 'localhost:8080/cost-of-living/city_result.jsp?country=Ireland&city=Dublin' -O /root/numbeo_test.out   #curl -L -m 2 --retry 1 -v  -o /root/numbeo_test.out --trace curl.log 'localhost:8008/cost-of-living/in/Dublin'   curl -L -m 2 -v --insecure -o /root/numbeo_test.out --trace curl.log 'https://localhost:8181/cost-of-living/in/Dublin <https://localhost:8181/cost-of-living/in/Dublin>'
   wgetOutput=$?
grep -q "entries in the past" /root/numbeo_test.out

if [ $? != 0 ]; then
#echo 'server is down!';
ps -eo pid,comm | while read pid command
do
    if [[ "$command" = "java" ]]
        then
                echo $pid
                DATE=`date +%Y-%m-%d`
                echo ${wgetOutput} > ~/wget_${DATE}_${pid}.log
               cp /root/numbeo_test.out > ~/numbeo_test_out_${DATE}_${pid}.log
                jstack -J-d64 -F $pid > ~/jstack_${DATE}_${pid}.log
                iostat > ~/iostat_${DATE}_${pid}.log
                vmstat > ~/vmstat_${DATE}_${pid}.log
                netstat -tnp > ~/netstat_${DATE}_${pid}.log
               netstat -anp |grep 'tcp\|udp' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n > ~/netstat_anp_outline_${DATE}_${pid}.log
                ps aux > ~/ps_aux_${DATE}_${pid}.log
               tail -n 5000 ~glassfish/apache-tomcat-8.5.5/logs/catalina.out > ~/catalina_out_${DATE}_${pid}.log
                break
    fi
done
echo 'too many server failures... going to rebootsoftly' >> ~/reboot.log ;
date | mail -s "Numbeo soft reset" mladen.adamo...@gmail.com <mailto:mladen.adamo...@gmail.com>
date >> ~/reboot.log
killall -9 java
/root/fix_letsencrypt_chmod.sh
#/etc/init.d/glassfish start
/etc/init.d/tomcat start
#reboot
fi
fi

That seems a little fragile, but it's your server so I guess you can do what you want.

    I see you are using Let's Encrypt. How are you managing the rotating of
    the keys and certificates?


Crontab: 5   1  1   *   *     /root/renew_cert_numbeo.sh
root@condor1796 ~ # cat renew_cert_numbeo.sh
#!/bin/bash

mkdir -p /tmp/letsencrypt/public_html
certbot certonly -n --force-renewal --webroot --webroot-path /tmp/letsencrypt/public_html -d numbeo.com <http://numbeo.com> -d www.numbeo.com <http://www.numbeo.com> \         -d es.numbeo.com <http://es.numbeo.com> -d pt.numbeo.com <http://pt.numbeo.com> -d fr.numbeo.com <http://fr.numbeo.com> -d ru.numbeo.com <http://ru.numbeo.com> -d ja.numbeo.com <http://ja.numbeo.com> -d de.numbeo.com <http://de.numbeo.com> -d nl.numbeo.com <http://nl.numbeo.com> \         -d it.numbeo.com <http://it.numbeo.com> -d zh.numbeo.com <http://zh.numbeo.com> -d ar.numbeo.com <http://ar.numbeo.com> -d jobs.numbeo.com <http://jobs.numbeo.com> \      --agree-tos --email mladen.adamo...@gmail.com <mailto:mladen.adamo...@gmail.com>

/root/fix_letsencrypt_chmod.sh
if [ $? != 0 ]; then
   date | mail -s "Lets encrypt renew certificate fails for numbeo.com <http://numbeo.com>" mladen.adamo...@gmail.com <mailto:mladen.adamo...@gmail.com>
else
    /etc/init.d/tomcat restart
fi

root@condor1796 ~ # cat fix_letsencrypt_chmod.sh
#!/bin/bash
chmod o+rx /etc/letsencrypt
chmod -R o+rx /etc/letsencrypt/*

root@condor1796 ~ #

I think your scripts will restart Tomcat even when it's not necessary. The $? check before sending the email message looks like it should be checking the result of the certbot command, but it's checking the result of the chmod command instead. (Or maybe the result of the .sh script, which will proably be 0.)

     > *What would be the next steps how to identify the problem and perhaps
     > solve it?*
    What have you done so far?


aaah... reading the Tomcat source to try to understand the state of Threads.

    I don't see anything that sticks out in your thread dump.


There are several threads which are trying to get monitor in AprEndpoint$Poller.add and no thread seems to be blocking it. Don't you find it weird:

root@condor1796 ~ # grep Poller jstack_2020-12-16_31415.log  | grep "Apr"
 - org.apache.tomcat.util.net.AprEndpoint$Poller.add(long, long, int) @bci=102, line=1398 (Compiled frame)

I might have found that odd had you posted that in your original message, but you did not.

You need to show the full stack trace for that thread to make it meaningful. Sockets are added to the poller all the time. It's not unusual to see that happening. It they are getting *stuck*, that would be bad, of course.

Don't you find it weird that all threads are trying to get synchronized on a Poller instance and no one is in this block or another synchronized block/method?

I would find it weird if no threads were making any progress. Lots of threads adding sockets to the poller is not out of the ordinary.

If you suspect a bug in Tomcat's socket handling, upgrading to the latest 8.5.x release and re-trying would be the best move. There have been many fixes since your 8.5.5 release which is now 4+ years old.

-chris

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to