On Thu, Aug 27, 2020 at 4:30 PM Christopher Schultz <ch...@christopherschultz.net> wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > David, > > On 8/27/20 17:14, David wrote: > > Thank you all for the replies! > > > > On Thu, Aug 27, 2020 at 3:53 PM Christopher Schultz > > <ch...@christopherschultz.net> wrote: > >> > > David, > > > > On 8/27/20 13:57, David wrote: > >>>> On Thu, Aug 27, 2020 at 12:35 PM Christopher Schultz > >>>> <ch...@christopherschultz.net> wrote: > >>>>> > >>>> David, > >>>> > >>>> On 8/27/20 10:48, David wrote: > >>>>>>> In the last two weeks I've had two occurrences where a > >>>>>>> single CentOS 7 production server hosting a public > >>>>>>> webpage has become unresponsive. The first time, all > >>>>>>> 300 available "https-jsse-nio-8443" threads were > >>>>>>> consumed, with the max age being around 45minutes, and > >>>>>>> all in a "S" status. This time all 300 were consumed in > >>>>>>> "S" status with the oldest being around ~16minutes. A > >>>>>>> restart of Tomcat on both occasions freed these threads > >>>>>>> and the website became responsive again. The > >>>>>>> connections are post/get methods which shouldn't take > >>>>>>> very long at all. > >>>>>>> > >>>>>>> CPU/MEM/JVM all appear to be within normal operating > >>>>>>> limits. I've not had much luck searching for articles > >>>>>>> for this behavior nor finding remedies. The default > >>>>>>> timeout values are used in both Tomcat and in the > >>>>>>> applications that run within as far as I can tell. > >>>>>>> Hopefully someone will have some insight on why the > >>>>>>> behavior could be occurring, why isn't Tomcat killing > >>>>>>> the connections? Even in a RST/ACK status, shouldn't > >>>>>>> Tomcat terminate the connection without an ACK from the > >>>>>>> client after the default timeout? > >>>> > >>>> Can you please post: > >>>> > >>>> 1. Complete Tomcat version > >>>>> I can't find anything more granular than 9.0.29, is there > >>>>> a command to show a sub patch level? > > > > 9.0.29 is the patch-level, so that's fine. You are about 10 > > versions out of date (~1 year). Any chance for an upgrade? > > > >> They had to re-dev many apps last year when we upgraded from I > >> want to say 1 or 3 or something equally as horrific. Hopefully > >> they are forward compatible with the newer releases and if not > >> should surely be tackled now before later, I will certainly bring > >> this to the table! > > I've rarely been bitten by an upgrade from foo.bar.x to foo.bar.y. > There is a recent caveat if you are using the AJP connector, but you > are not so it's not an issue for you. > > >>>> 2. Connector configuration (possibly redacted) > >>>>> This is the 8443 section of the server.xml *8080 is > >>>>> available during the outage and I'm able to curl the > >>>>> management page to see the 300 used threads, their status, > >>>>> and age* <Service name="Catalina"> > >>>>> > >>>>> [snip] > >>>>> > >>>>> <Connector port="8080" protocol="HTTP/1.1" > >>>>> connectionTimeout="20000" redirectPort="8443" /> [snip] > >>>>> <Connector port="8443" > >>>>> protocol="org.apache.coyote.http11.Http11NioProtocol" > >>>>> maxThreads="300" SSLEnabled="true" > <SSLHostConfig> > >>>>> <Certificate > >>>>> certificateKeystoreFile="/opt/apache-tomcat-9.0.29/redacted.jks" > >>>>> > >>>>> > certificateKeystorePassword="redacted" type="RSA" /> > >>>>> </SSLHostConfig> </Connector> [snip] <Connector > >>>>> port="8443" > >>>>> protocol="org.apache.coyote.http11.Http11NioProtocol" > >>>>> maxThreads="300" SSLEnabled="true" > <SSLHostConfig > >>>>> protocols="TLSv1.2"> <Certificate > >>>>> certificateKeystoreFile="/opt/apache-tomcat-9.0.29/redacted.jks" > >>>>> > >>>>> > certificateKeystorePassword="redacted" type="RSA" /> > >>>>> </SSLHostConfig> </Connector> > > > > What, two connectors on one port? Do you get errors when starting? > >> No errors, one is "with HTTP2" should I delete the other former? > > Well, one of them will succeed in starting the and other one should > fail. Did you copy/paste your config without modification? Weird you > don't have any errors. Usually you'll get an IOException or whatever > binding to the port twice.
I do recall IOExceptions and "port already in use" errors that caused Tomcat to not start, but I think these were related to syntax errors when defining catalina variables for my JVM sizing. I'll take another look at catalina.out and ensure I don't still see these, and will likely clean up the non "with http2" connector out of the config regardless. The only edits to the section of the supplied xml were the .jks store name and pw. > > > I don't see anything obviously problematic in the above > > configuration (other than the double-definition of the 8443 > > connector). > > > > 300 tied-up connections (from your initial report) sounds like a > > significant number: probably the thread count. > >> Yes sir, that's the NIO thread count for the 8443 connector. > > > > Mark (as is often the case) is right: take some thread dumps next > > time everything locks up and see what all those threads are doing. > > Often, it's something like everything is awaiting on a db > > connection and the db pool has been exhausted or something. > > Relatively simple quick-fixes are available for that, and better, > > longer-term fixes as well. > > > >> Mark/Chris is there a way to dump the connector threads > >> specifically? Or simply is it all contained as a machine/process > >> thread? Sorry I'm not really a Linux guy. > > Most of the threads in the server will be connector threads. They will > have names like https-nio-[port]-exec-[number]. > > If you get a thread dump[1], you'll get a stack trace from every thread. > > Rainer wrote a great presentation about them in the context of Tomcat. > Feel free to give it a read: > http://home.apache.org/~rjung/presentations/2018-06-13-ApacheRoadShow-Ja > vaThreadDumps.pdf Awesome!! Thank you for that, I will certainly read it! > > >>>> Do you have a single F5 or a group of them? > >>>>> A group of them, several HA pairs depending on internal or > >>>>> external and application. This server is behind one HA > >>>>> pair and is a single server. > > > > Okay. Just remember that each F5 can make some large number of > > connections to Tomcat, so you need to make sure you can handle > > them. > > > > This was a much bigger deal back in the BIO days when thread limit > > = connection limit, and the thread limit was usually something like > > 250 - 300. NIO is much better, and the default connection limit is > > 10k which "ought to be enough for anyone"[1]. > >> (lol) > > > >> I'm more used to the 1-1 of the BIO style, which kinda confused > >> me when I asked the F5 to truncate >X connections and alert me > >> and there were 600+ connections while Tomcat manager stated ~30. > >> Then I read what the non-interrupt was about. > > Yeah, NIO allows Tomcat to accept a large number of connections and > have a small number of threads process the work they represent. It's > not totally swarm-style processing because (a) the servlet spec makes > some guarantees about which thread processes your request and (b) Java > doesn't really have the ability to pause execution in one thread and > let another thread take it over. > > If you really want totally asynchronous processing, your application > must opt-into it using a special API. So if you have a bog-standard > read, process, write style application, then 300 simultaneous requests > will be all you can handle. (Unless you raise that limit, of course.) > > You said something interesting earlier and I want to make sure I > understood you correctly. You said that the application locked-up but > you were able to use curl to observe something. Can you be really > specific about that? Most requests come through port 8443. Which port > did you connect to in order to call curl? If it's 8443 than that's > suspicious. If it's 8080 then it makes more sense, as there will be a > different thread-pool used for each of those connectors. > That is correct, I used the http to 8080 in order to read the Tomcat webmanager stats. I originally had issues with the JVM being too small, running out of memory, CPU spiking, threads maxing out, and whole system instability. Getting more machine memory and upping the JVM allocation has remedied all of that except for apparently the thread issue. I'm unsure that they were aging at that time as I couldn't get into anything, but with no room for GC to take place it would make sense that the threads would not be released. My intention was to restart Tomcat nightly to lessen the chance of an occurrence until I could find a way to restart Tomcat based on the thread count and script a thread dump at the same time, (likely through Solarwinds). Now that you've explained that the NIO threads are a part of the system threads, I may be able to script something like that directly on the system, with a chrontab to check count, if >295 contains NIO dump thread to / systemctl stop-start tomcat. That's very warming as it seems a viable way to get the data I need without posing much impact to users. Your explanation of threads leads me to believe that the nightly restart may be rather moot as it could likely be exhaustion on the downstream causing the backup on the front end. I didn't see these connected in this way and assumed they were asynchronous and independent processes. There are timeouts configured for all the DB2 backend connections, and I was in the mindset of the least timeout would kill all connections upstream/downstream by presenting the application a forcibly closed by remote host or a timeout. I greatly appreciate the assistance, In looking through various articles none of this was really discussed because either everyone knows it, or maybe it was discussed on a level where I couldn't understand it, there certainly doesn't seem to be any other instances of connections being open for 18-45minutes or if there is it's not an issue for them. During a normal glance at the manager page, there are no connections and maybe like 5 empty lines in a "Ready" stage, even if I spam the server's logon landing page I can never see a persistent connection, so it baffled me as to how connections could hang and build up, so I'm thinking something was perhaps messed up with the backend. The webapp names /URL's for the oldest connections didn't coincide between the two outages, so I kind of brushed it off as being application specific, however it may still be. I need it to occur again and get some dumps! > > > [1] > https://cwiki.apache.org/confluence/display/TOMCAT/HowTo#HowTo-HowdoIobt > ainathreaddumpofmyrunningwebapp? > > -----BEGIN PGP SIGNATURE----- > Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ > > iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl9IJboACgkQHPApP6U8 > pFjMwQ//aTiwmuOChBg1VtNeaFXqieclyTlAYKswe8QNtMqaug93YzPhBOsbXnEp > 0bWONHuLFVfFH3ZPZb0JWAvICL/qzUb31d45RBh2JIoytsertkZvpxsqc/OIy6sz > TRsqD0qPIdT9jOFRl8zI9kK3j/afSJhWBvSMHG4kxz+g9ZwP159PGaWtyGjd9pP+ > YQO76xQVoKpcYDSW/Miiil5L2pMFxIK/gYNZpxxisCDTAGbUZIBri6yGEL+z2S2M > bEpT+dJvzUb5qkrmdBlOzRN2cw0vBuBmx4PL+fUu1E5ruRMLelWf5MM5LxHbd7wD > SV0C0RXL+hVlWLLXtcqKInFJZtLtcEnBqsu/+n6FlF9s01wL1AFbKauyZe2GpDo3 > R+ggSflvnxYTsl1BtTYJexxewb17BkPd2Wj8nJAUIyf7SorQ3rt3btNUnrhNejGe > Fi5s1OD72YAdn2KAEAJDHHndVqCVsLK7Yj0ka4EibFvqM6ke1xnWNz2ufba4JoRu > qaALnxKAg3yd3eHlt0UiTfbi5LvqJiIspBBV1jOMJnZskzKBb4u1WZYUSoJYUvnO > i8KJRhu671KFg8Cqk0B+QQkR1D40yh3ejjwRIUfD9FF4B1HJT10khCh/99Ot1ZPp > bnRpq3BU0H+ECN2fp6Yd4CO9mRmx7m24g63l40qMcJfx5KyEwIs= > =4r1b > -----END PGP SIGNATURE----- > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org