On 24/10/2020 01:32, Eric Robinson wrote: <snip/>
>>> -----Original Message----- >>> From: Mark Thomas <ma...@apache.org> <snip/> >>> The failed request: >>> - Completes in ~6ms >> >> I think we've seen the failed requests take as much as 50ms. Ack. That is still orders of magnitude smaller that the timeout and consistent with generation time of some of the larger responses. I wouldn't sat it confirms any of my previous conclusions but it doesn't invalidate them either. >>> Follow-up questions: >>> - JVM >>> - Vendor? >>> - OS package or direct from Vendor? >>> >> >> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site. OK. That is post Java 8u202 so it should be a paid for, commercially supported version of Java 8. The latest Java 8 release from Oracle is 8u271. The latest Java 8 release from AdoptOpenJDK is 8u272. I don't think we are quite at this point yet but what is your view on updating to the latest Java 8 JDK (from either Oracle or AdoptOpenJDK). >>> - Tomcat >>> - OS package, 3rd-party package or direct from ASF? >>> >> >> tomcat.noarch 7.0.76-6.el7 from CentOS base repository >> > > Drat, slight correction. I now recall that although we initially installed > 7.0.76 from the CentOS repo, the application vendor made us lower the version > to 7.0.72, and I DO NOT know where we got that. However, it has not changed > since October-ish of 2018. I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any relevant changes. >>> - Config >>> - Any changes at all around the time the problems started? I'm >>> thinking OS updates, VM restarted etc? >>> >> >> server.xml has not changed since 4/20/20, which was well before the >> problem manifested, and all the other files in the conf folder are even older >> than that. We're seeing this symptom on both production servers. One of >> them was rebooted a week ago, but the other has been up continuously for >> 258 days. OK. That rules a few things out which is good but it does make the trigger for this issue even more mysterious. Any changes in the Nginx configuration in the relevant timescale? Any updates to the application in the relevant timescale? Any features users started using that hadn't been used before in that timescale? <snip/> >>> Recommendations: >>> - Switch back to the BIO connector if you haven't already. It has fewer >>> moving parts than NIO so it is simpler debug. >>> - Add "%b" to the access log pattern for Tomcat's access log valve to >>> record the number of body (excluding headers) bytes Tomcat believes it >>> has written to the response. >>> >>> >>> Next steps: >>> - Wait for the issue to re-occur after the recommended changes above and >>> depending on what is recorded in the access log for %b for a failed >>> request, shift the focus accordingly. >>> - Answers to the additional questions would be nice but the access log >>> %b value for a failed request is the key piece of information required >>> at this point. >>> >> >> Good news! I enabled that parameter a few days ago and we have already >> caught some instances of the problem occurring. Excellent! >> Here is the logging format... >> >> <Valve className="org.apache.catalina.valves.AccessLogValve" >> directory="logs" >> prefix="localhost_access." suffix=".log" pattern="%h %l %D %u >> %t >> %{JSESSIONID}c %{cookie}i %r %s %b %S %q" /> >> >> Due to some sensitive content in the HTTP requests below, I have globally >> replaced certain words and addresses with random-ish strings, but I don't >> think I've done anything to impact the issue. >> >> Following is an example from Wednesday. >> >> This is a request being sent from the nginx proxy to the first of 2 upstream >> servers, 10.51.14.46 >> >> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream prematurely >> closed connection while reading response header from upstream, client: >> 99.88.77.66, server: redacted.domain.com, request: "GET >> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10- >> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >> 5650846×tamp=21102020155122.472656 HTTP/1.1", upstream: >> "http://10.51.14.46:3016/sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=20 >> 20-10- >> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >> 5650846×tamp=21102020155122.472656", host: >> "redacted.domain.com" >> >> Here is the matching localhost_access log entry from that server…. >> >> 10.51.14.133 - 144 - [21/Oct/2020:15:51:22 -0400] >> F405E25E49E3DCB81A36A87DED1FE573 >> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573; >> srv_id=dea8d61a7d725e980a6093cb78d8ec73; >> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573; >> srv_id=dea8d61a7d725e980a6093cb78d8ec73 GET >> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10- >> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >> 5650846×tamp=21102020155122.472656 HTTP/1.0 200 40423 >> F405E25E49E3DCB81A36A87DED1FE573 ?eDate=2020-10- >> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID >> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0. >> 5650846×tamp=21102020155122.472656 >> >> Tomcat appears to think it sent 40423 bytes. However, even though it shows >> an HTTP 200 response, WireShark shows the response did not actually go out >> on the wire. That is very interesting. That narrows down where things might be going wrong. It confirms that the application is generating the response so we can rule out a bug in that process. By default there is some buffering here. The buffer size is 9000 bytes. So with a response of 40423 I'd expect at least some of those bytes to be written to network. However, Wireshark is reporting no bytes written. At the moment, I see the following options: 1. Tomcat bug causing the writes to made to the wrong response. 2. Application bug (retaining references to request/response/writer etc.) causing the writes to made to the wrong response. 3. JVM bug causing network writes to be dropped. 4. OS bug causing network writes to be dropped before they reach the point where they are observable by Wireshark. To be frank, none of those options seem particularly likely. At this point I'd echo Konstantin's recommendation to add the following system property: org.apache.catalina.connector.RECYCLE_FACADES=true You'd normally do this in $CATALINA_HOME/bin/setenv.sh (creating that file if necessary) with a line like: CATALINA_OPTS="$CATALINA_OPTS -Dorg.apache.catalina.connector.RECYCLE_FACADES=true" You can confirm that the setting has been applied by looking in the log for the start-up. You should see something like: Oct 26, 2020 10:18:45 AM org.apache.catalina.startup.VersionLoggerListener log INFO: Command line argument: -Dorg.apache.catalina.connector.RECYCLE_FACADES=true That option reduces the re-use of request, response and related objects between requests and, if an application is retaining references it shouldn't, you usually see a bunch of NullPointerExceptions in the logs when the application tries to re-use those objects. Meanwhile, I'm going to work on a custom patch for 7.0.72 to add some additional logging around the network writes. Mark --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org