Hello Rainer,
First of all, thank you for your extensive answer and the time you have
taken to write the answer, this really gives me hope.
Rainer Jung-3 wrote:
Double check: The worker is a member of a load balancer. the member is
*not* in state STOP (because that is a configuration state) but in ERROR
(which is a runtime detected state).
You are right, the worker is (the only) member of a load balancer. It is in
"OK" state until Tomcat hangs. Then it changes into "ERROR".
Rainer Jung-3 wrote:
First: you don't use a reply_timeout? At this stage you shouldn't, just
want to make sure.
I haven't configured a reply_timeout.
Rainer Jung-3 wrote:
How to do thread dumps: if Tomcat is running from a DOS box, you can use
CTRL-Break on the keyboard (and the dumps go directly to the DOS box),
if it is running as a service, there is an entry in the context menue of
the tomcat monitor icon (system tray), and the dumps go to the service
log file.
Tomcat was hanging a few minutes ago, and I have created some thread dumps,
which are available in the uploaded ZIP file.
Rainer Jung-3 wrote:
Use "netstat -an" on the IIS system and the Tomcat system (if they are
not the same) to produce a list of TCP connections and their state.
For your information: Tomcat and IIS are on the same system. I have also
included a few netstat logs from the moment hanging. They can also be found
in the attached zip file.
Rainer Jung-3 wrote:
If possible use wireshark to produce a full packet dump of the
communications between the two for a minute or so, namely long enough,
that the cited log message occur a few times.
I have downloaded and installed Wireshark. I have included a few minutes of
Wireshark captured data in the zip file too.
Rainer Jung-3 wrote:
- remove the socket_timeout
and
- remove the APR connector (tcnative)
If this solves the problem, check, if removing only of of them suffices.
If this quick test indicates APR connector as problematic, upgrade to
1.1.13 (or the soon to appear 1.1.14).
I have already tried to remove the APR connector, but this was really not a
good idea. Without APR, Tomcat hung after only one hour of normal use. With
APR, it lasts for about half a day. During the last hang, I downgraded APR
to 1.1.10, which we were using before 1.1.12, and which seems to be a little
more stable. I haven't been able to find 1.1.13 for Windows x64. Is it
available? I tried the http://tomcat.heanet.ie/native/ link.
Should I really try to remove the socket_timeout? Should I try this before
setting the reply_timeout to 60 seconds, as you state later in your mail?
Rainer Jung-3 wrote:
The log information in 1.2.26 should be more precise though. At least
for me ;)
When we used 1.2.26, logging was more precise indeed. But it seemed to be
less stable, although I'm not sure if this has anything to do with the
connector version, since I also changed the tcnative version.
Rainer Jung-3 wrote:
Here I guess: since there was no reply_timeout set, the socket_timeout
fires after 10 seconds, aborts the wait and resets the connection. If
you can log response times with IIS, you could check, if they are above
10 seconds. You can also log response times with an appropriate
JkRequestLogFormat.
How should I set the JkRequestLogFormat? Isn't that an Apache (webserver)
directive? I am (and have to be - company policies) using IIS.
Rainer Jung-3 wrote:
You could set a reply timeout to a huge value, like eg. 60 seconds, if
you think that even under load *all* requests should return in less than
60 seconds. We can optimize this setting later (with max_reply_timeouts
in 1.2.26).
I will try this, but not yet. Not all at the same time :)
Rainer Jung-3 wrote:
You could try TCP tuning like in
http://support.microsoft.com/kb/191143
but I doubt, that this will resolve the root consequence.
This sounds unlikely to me too, so this will be a last resort maybe..
Rainer Jung-3 wrote:
Aha, if this is really coming before the error "60", then you should aso
look at:
http://support.microsoft.com/kb/931319/
Sounds like it could be helping. I have installed the hotfix. But, the
system needs to be restarted in order to active the hotfix (argh!),
something I can't just do when when the traffic is high. Maybe I'll reboot
the server tonight.
Rainer Jung-3 wrote:
Maybe too many suggestions and not a straight solution, but if you are
able to collect more information, we should be able to sort this out.
I hope the logs/dumps will help you.. I will look into them myself also.
Rainer Jung-3 wrote:
Do others have the same issue on Windows? Did they find a solution?
I have searched all over the web, and there is a lot of information about
this whole setup, but it's very fragmented and the opinions are pretty wide
spread.
Again, thank you very much for your help and time so far. I hope we will be
able to resolve this problem!
http://www.nabble.com/file/p18255109/20080703_tomcat_hang_dumps.zip
20080703_tomcat_hang_dumps.zip