On 26/10/2020 10:26, Mark Thomas wrote:
> On 24/10/2020 01:32, Eric Robinson wrote:
> 
> <snip/>
> 
>>>> -----Original Message-----
>>>> From: Mark Thomas <ma...@apache.org>
> 
> <snip/>
> 
>>>> The failed request:
>>>> - Completes in ~6ms
>>>
>>> I think we've seen the failed requests take as much as 50ms.
> 
> Ack. That is still orders of magnitude smaller that the timeout and
> consistent with generation time of some of the larger responses.
> 
> I wouldn't sat it confirms any of my previous conclusions but it doesn't
> invalidate them either.
> 
>>>> Follow-up questions:
>>>> - JVM
>>>>   - Vendor?
>>>>   - OS package or direct from Vendor?
>>>>
>>>
>>> jdk-8u221-linux-x64.tar.gz downloaded from the Oracle web site.
> 
> OK. That is post Java 8u202 so it should be a paid for, commercially
> supported version of Java 8.
> 
> The latest Java 8 release from Oracle is 8u271.
> 
> The latest Java 8 release from AdoptOpenJDK is 8u272.
> 
> I don't think we are quite at this point yet but what is your view on
> updating to the latest Java 8 JDK (from either Oracle or AdoptOpenJDK).
> 
>>>> - Tomcat
>>>>   - OS package, 3rd-party package or direct from ASF?
>>>>
>>>
>>> tomcat.noarch  7.0.76-6.el7 from CentOS base repository
>>>
>>
>> Drat, slight correction. I now recall that although we initially installed 
>> 7.0.76 from the CentOS repo, the application vendor made us lower the 
>> version to 7.0.72, and I DO NOT know where we got that. However, it has not 
>> changed since October-ish of 2018.
> 
> I've reviewed the 7.0.72 to 7.0.76 changelog and I don't see any
> relevant changes.
> 
>>>> - Config
>>>>   - Any changes at all around the time the problems started? I'm
>>>>     thinking OS updates, VM restarted etc?
>>>>
>>>
>>> server.xml has not changed since 4/20/20, which was well before the
>>> problem manifested, and all the other files in the conf folder are even 
>>> older
>>> than that. We're seeing this symptom on both production servers. One of
>>> them was rebooted a week ago, but the other has been up continuously for
>>> 258 days.
> 
> OK. That rules a few things out which is good but it does make the
> trigger for this issue even more mysterious.
> 
> Any changes in the Nginx configuration in the relevant timescale?
> 
> Any updates to the application in the relevant timescale?
> 
> Any features users started using that hadn't been used before in that
> timescale?
> 
> <snip/>
> 
>>>> Recommendations:
>>>> - Switch back to the BIO connector if you haven't already. It has fewer
>>>>   moving parts than NIO so it is simpler debug.
>>>> - Add "%b" to the access log pattern for Tomcat's access log valve to
>>>>   record the number of body (excluding headers) bytes Tomcat believes it
>>>>   has written to the response.
>>>>
>>>>
>>>> Next steps:
>>>> - Wait for the issue to re-occur after the recommended changes above and
>>>>   depending on what is recorded in the access log for %b for a failed
>>>>   request, shift the focus accordingly.
>>>> - Answers to the additional questions would be nice but the access log
>>>>   %b value for a failed request is the key piece of information required
>>>>   at this point.
>>>>
>>>
>>> Good news! I enabled that parameter a few days ago and we have already
>>> caught some instances of the problem occurring.
> 
> Excellent!
> 
>>> Here is the logging format...
>>>
>>> <Valve className="org.apache.catalina.valves.AccessLogValve"
>>> directory="logs"
>>>                prefix="localhost_access." suffix=".log" pattern="%h %l %D 
>>> %u %t
>>> %{JSESSIONID}c %{cookie}i %r %s %b %S %q" />
>>>
>>> Due to some sensitive content in the HTTP requests below, I have globally
>>> replaced certain words and addresses with random-ish strings, but I don't
>>> think I've done anything to impact the issue.
>>>
>>> Following is an example from Wednesday.
>>>
>>> This is a request being sent from the nginx proxy to the first of 2 upstream
>>> servers, 10.51.14.46
>>>
>>> 2020/10/21 15:51:22 [error] 39268#39268: *842342531 upstream prematurely
>>> closed connection while reading response header from upstream, client:
>>> 99.88.77.66, server: redacted.domain.com, request: "GET
>>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-
>>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>>> 5650846&timestamp=21102020155122.472656 HTTP/1.1", upstream:
>>> "http://10.51.14.46:3016/sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=20
>>> 20-10-
>>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>>> 5650846&timestamp=21102020155122.472656", host:
>>> "redacted.domain.com"
>>>
>>> Here is the matching localhost_access log entry from that server….
>>>
>>> 10.51.14.133 - 144 - [21/Oct/2020:15:51:22 -0400]
>>> F405E25E49E3DCB81A36A87DED1FE573
>>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
>>> srv_id=dea8d61a7d725e980a6093cb78d8ec73;
>>> JSESSIONID=F405E25E49E3DCB81A36A87DED1FE573;
>>> srv_id=dea8d61a7d725e980a6093cb78d8ec73 GET
>>> /sandhut/jsp/catalog/xml/getWidgets.jsp?eDate=2020-10-
>>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>>> 5650846&timestamp=21102020155122.472656 HTTP/1.0 200 40423
>>> F405E25E49E3DCB81A36A87DED1FE573 ?eDate=2020-10-
>>> 21&statueId=64438&facilityId=0&DeptId=0&CompressXML=Yes&sessionDID
>>> =0&GzUserId=75064&gizusrlogtblid=322095&gizappprocessid=8568&rnd2=0.
>>> 5650846&timestamp=21102020155122.472656
>>>
>>> Tomcat appears to think it sent 40423 bytes. However, even though it shows
>>> an HTTP 200 response, WireShark shows the response did not actually go out
>>> on the wire.
> 
> That is very interesting. That narrows down where things might be going
> wrong.
> 
> It confirms that the application is generating the response so we can
> rule out a bug in that process.
> 
> By default there is some buffering here. The buffer size is 9000 bytes.
> So with a response of 40423 I'd expect at least some of those bytes to
> be written to network. However, Wireshark is reporting no bytes written.
> 
> 
> At the moment, I see the following options:
> 
> 1. Tomcat bug causing the writes to made to the wrong response.
> 
> 2. Application bug (retaining references to request/response/writer
>    etc.) causing the writes to made to the wrong response.
> 
> 3. JVM bug causing network writes to be dropped.
> 
> 4. OS bug causing network writes to be dropped before they reach the
>    point where they are observable by Wireshark.
> 
> To be frank, none of those options seem particularly likely.
> 
> At this point I'd echo Konstantin's recommendation to add the following
> system property:
> org.apache.catalina.connector.RECYCLE_FACADES=true
> 
> You'd normally do this in $CATALINA_HOME/bin/setenv.sh (creating that
> file if necessary) with a line like:
> 
> CATALINA_OPTS="$CATALINA_OPTS
> -Dorg.apache.catalina.connector.RECYCLE_FACADES=true"
> 
> You can confirm that the setting has been applied by looking in the log
> for the start-up. You should see something like:
> 
> Oct 26, 2020 10:18:45 AM
> org.apache.catalina.startup.VersionLoggerListener log
> INFO: Command line argument:
> -Dorg.apache.catalina.connector.RECYCLE_FACADES=true
> 
> 
> That option reduces the re-use of request, response and related objects
> between requests and, if an application is retaining references it
> shouldn't, you usually see a bunch of NullPointerExceptions in the logs
> when the application tries to re-use those objects.
> 
> Meanwhile, I'm going to work on a custom patch for 7.0.72 to add some
> additional logging around the network writes.

Patch files and instructions for use:

http://home.apache.org/~markt/dev/v7.0.72-custom-patch-v1/

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to