On 26/10/2020 10:26, Mark Thomas wrote: > On 24/10/2020 01:32, Eric Robinson wrote: > > <snip/> > >>>> -----Original Message----- >>>> From: Mark Thomas <ma...@apache.org> > > <snip/> > >>>> The failed request: >>>> - Completes in ~6ms >>> >>> I think we've seen the failed requests take as much as 50ms. > > Ack. That is still orders of magnitude smaller that the timeout and > consistent with generation time of some of the larger responses. > > I wouldn't sat it confirms any of my previous conclusions but it doesn't > invalidate them either. > >>>> Follow-up questions: >>>> - JVM >>>> - Vendor? >>>> - OS package or direct from Vendor? >>>> >>> >>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site. > > OK. That is post Java 8u202 so it should be a paid for, commercially > supported version of Java 8. > > The latest Java 8 release from Oracle is 8u271. > > The latest Java 8 release from AdoptOpenJDK is 8u272. > > I don't think we are quite at this point yet but what is your view on > updating to the latest Java 8 JDK (from either Oracle or AdoptOpenJDK). > >>>> - Tomcat >>>> - OS package, 3rd-party package or direct from ASF? >>>> >>> >>> tomcat.noarch 7.0.76-6.el7 from CentOS base repository >>> >> >> Drat, slight correction. I now recall that although we initially installed >> 7.0.76 from the CentOS repo, the application vendor made us lower the >> version to 7.0.72, and I DO NOT know where we got that. However, it has not >> changed since October-ish of 2018. > > I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any > relevant changes. > >>>> - Config >>>> - Any changes at all around the time the problems started? I'm >>>> thinking OS updates, VM restarted etc? >>>> >>> >>> server.xml has not changed since 4/20/20, which was well before the >>> problem manifested, and all the other files in the conf folder are even >>> older >>> than that. We're seeing this symptom on both production servers. One of >>> them was rebooted a week ago, but the other has been up continuously for >>> 258 days. > > OK. That rules a few things out which is good but it does make the > trigger for this issue even more mysterious. > > Any changes in the Nginx configuration in the relevant timescale? > > Any updates to the application in the relevant timescale? > > Any features users started using that hadn't been used before in that > timescale? > > <snip/> > >>>> Recommendations: >>>> - Switch back to the BIO connector if you haven't already. It has fewer >>>> moving parts than NIO so it is simpler debug. >>>> - Add "%b" to the access log pattern for Tomcat's access log valve to >>>> record the number of body (excluding headers) bytes Tomcat believes it >>>> has written to the response. >>>> >>>> >>>> Next steps: >>>> - Wait for the issue to re-occur after the recommended changes above and >>>> depending on what is recorded in the access log for %b for a failed >>>> request, shift the focus accordingly. >>>> - Answers to the additional questions would be nice but the access log >>>> %b value for a failed request is the key piece of information required >>>> at this point. >>>> >>> >>> Good news! I enabled that parameter a few days ago and we have already >>> caught some instances of the problem occurring. > > Excellent! > >>> Here is the logging format... >>> >>> <Valve className="org.apache.catalina.valves.AccessLogValve" >>> directory="logs" >>> prefix="localhost_access." suffix=".log" pattern="%h %l %D >>> %u %t >>> %{JSESSIONID}c %{cookie}i %r %s %b %S %q" /> >>> >>> Due to some sensitive content in the HTTP requests below, I have globally >>> replaced certain words and addresses with random-ish strings, but I don't >>> think I've done anything to impact the issue. >>> >>> Following is an example from Wednesday. >>> >>> This is a request being sent from the nginx proxy to the first of 2 upstream >>> servers, 10.51.14.46 >>> >>> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream prematurely >>> closed connection while reading response header from upstream, client: >>> 99.88.77.66, server: redacted.domain.com, request: "GET >>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10- >>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >>> 5650846×tamp=21102020155122.472656 HTTP/1.1", upstream: >>> "http://10.51.14.46:3016/sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=20 >>> 20-10- >>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >>> 5650846×tamp=21102020155122.472656", host: >>> "redacted.domain.com" >>> >>> Here is the matching localhost_access log entry from that server…. >>> >>> 10.51.14.133 - 144 - [21/Oct/2020:15:51:22 -0400] >>> F405E25E49E3DCB81A36A87DED1FE573 >>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573; >>> srv_id=dea8d61a7d725e980a6093cb78d8ec73; >>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573; >>> srv_id=dea8d61a7d725e980a6093cb78d8ec73 GET >>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10- >>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >>> 5650846×tamp=21102020155122.472656 HTTP/1.0 200 40423 >>> F405E25E49E3DCB81A36A87DED1FE573 ?eDate=2020-10- >>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >>> 5650846×tamp=21102020155122.472656 >>> >>> Tomcat appears to think it sent 40423 bytes. However, even though it shows >>> an HTTP 200 response, WireShark shows the response did not actually go out >>> on the wire. > > That is very interesting. That narrows down where things might be going > wrong. > > It confirms that the application is generating the response so we can > rule out a bug in that process. > > By default there is some buffering here. The buffer size is 9000 bytes. > So with a response of 40423 I'd expect at least some of those bytes to > be written to network. However, Wireshark is reporting no bytes written. > > > At the moment, I see the following options: > > 1. Tomcat bug causing the writes to made to the wrong response. > > 2. Application bug (retaining references to request/response/writer > etc.) causing the writes to made to the wrong response. > > 3. JVM bug causing network writes to be dropped. > > 4. OS bug causing network writes to be dropped before they reach the > point where they are observable by Wireshark. > > To be frank, none of those options seem particularly likely. > > At this point I'd echo Konstantin's recommendation to add the following > system property: > org.apache.catalina.connector.RECYCLE_FACADES=true > > You'd normally do this in $CATALINA_HOME/bin/setenv.sh (creating that > file if necessary) with a line like: > > CATALINA_OPTS="$CATALINA_OPTS > -Dorg.apache.catalina.connector.RECYCLE_FACADES=true" > > You can confirm that the setting has been applied by looking in the log > for the start-up. You should see something like: > > Oct 26, 2020 10:18:45 AM > org.apache.catalina.startup.VersionLoggerListener log > INFO: Command line argument: > -Dorg.apache.catalina.connector.RECYCLE_FACADES=true > > > That option reduces the re-use of request, response and related objects > between requests and, if an application is retaining references it > shouldn't, you usually see a bunch of NullPointerExceptions in the logs > when the application tries to re-use those objects. > > Meanwhile, I'm going to work on a custom patch for 7.0.72 to add some > additional logging around the network writes.
Patch files and instructions for use: http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v1/ Mark --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org