Hi Jesse,
Jesse Klaasse schrieb:
I have migrated a customer's server system to the following
configuration:
MS IIS 6.0 (port 80)
Unlimited connections
Connection timeout: 120s keep-alive
JK 1.2.25 AJP/1.3 Connector
isapi_redirect.properties:
uri_select=parsed
workers.properties:
connect_timeout=10000
prepost_timeout=10000
socket_timeout=10
connection_pool_timeout=60
connection_pool_size=350
Apache Tomcat 5.5.20 (ports 8080 / 8009)
Apache Portable Runtime tcnative-1.dll 1.1.12 installed
Java options: -XX:MaxPermSize=512m -Xloggc:D:\logs\gc\tomcat-gc.log
-XX:+PrintGCDetails -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+UseTLAB -XX:+CMSClassUnloadingEnabled
-XX:+CMSPermGenSweepingEnabled -Xms4096m -Xmx5120m
server.xml connector 8009:
acceptCount="100" maxThreads="400" minSpareThreads="25"
maxSpareThreads="75" connectionTimeout="60000"
This is a dual-core Xeon machine running Windows Server 2003 x64 with 6
GB RAM and Sun Java JDK 1.5.
I have to reboot Tomcat every night, and even then I have to restart
Tomcat manually almost every day because of a "Service temporary
unavailable" message. When Tomcat is hanging, JK Status Manager shows a
Bad/Stopped worker. A reset of the worker doesn't fix the problem,
neither does a restart of IIS. Only restarting Tomcat fixes the problem.
Double check: The worker is a member of a load balancer. the member is
*not* in state STOP (because that is a configuration state) but in ERROR
(which is a runtime detected state).
If this state doesn't go away, we have a persistent comminucation
problem with Tomcat (which is what your log files confirm).
In order to analyze, why this is, we can check the state of Tomcat
itself, and we can think about the state of the communication between
IIS and Tomcat.
First: you don't use a reply_timeout? At this stage you shouldn't, just
want to make sure.
Regarding error codes: the two digit error codes XX (like 60 or 53)
corrspond to winsock error codes 100XX. Those are listed e.g. on
http://support.microsoft.com/kb/819124
53=10053=WSAECONNABORTED
60=10060=WSAETIMEDOUT
Now what are possible reasons, and how to analyze?
1) Check the internal status of Tomcat, when the problem shows up. For
this make a couple of thread dumps, e.g. three thread dumps each about 3
seconds apart from each other. Thread dumps contain the information, in
which method stack each Tomcat thread is. So we can see, how many Tomcat
threads actually are working on a request, and often we can only see,
what type of request, if they are waiting for database responses etc.
With this test we can check whether there is congestion inside Tomcat,
or if the problem lies in the communication layer between IIS and Tomcat.
How to do thread dumps: if Tomcat is running from a DOS box, you can use
CTRL-Break on the keyboard (and the dumps go directly to the DOS box),
if it is running as a service, there is an entry in the context menue of
the tomcat monitor icon (system tray), and the dumps go to the service
log file.
You can post the Dumps here, maybe it's better to open an issue, because
then you can attach the files to Bugzilla.
2) Assume, that Tomcat seems to be mostly idle, once the problem shows up.
Then we would need to understand the communication layer:
Use "netstat -an" on the IIS system and the Tomcat system (if they are
not the same) to produce a list of TCP connections and their state.
If possible use wireshark to produce a full packet dump of the
communications between the two for a minute or so, namely long enough,
that the cited log message occur a few times.
3) General suggestion ("shot in the dark")
You could do a simple test to quickly check:
- remove the socket_timeout
and
- remove the APR connector (tcnative)
If this solves the problem, check, if removing only of of them suffices.
If this quick test indicates APR connector as problematic, upgrade to
1.1.13 (or the soon to appear 1.1.14).
The isapi_redirect log contains a lot of these lines:
[Wed Jul 02 16:19:16.908 2008] [11604:13424] [info]
ajp_send_request::jk_ajp_common.c (1265): (ajp13w) all endpoints are
disconnected
[Wed Jul 02 16:19:16.908 2008] [11604:13424] [info]
ajp_service::jk_ajp_common.c (2085): (ajp13w) sending request to tomcat
failed, recoverable operation attempt=1
Those are only higher level messages, not indicating the root cause.
I have also tried the 1.2.26 connector. It seemed like it needed a
restart even sooner with that one. These lines were in
The log information in 1.2.26 should be more precise though. At least
for me ;)
isapi_redirect.log a lot of times with the 1.2.26 version:
[Wed Jul 02 15:44:51.040 2008] [11604:12396] [error]
ajp_connection_tcp_get_message::jk_ajp_common.c (966): (ajp13w) can't
receive the response message from tomcat, network problems or tomcat
(127.0.0.1:8009) is down (errno=60)
The line number tells me, that is still 1.2.25, but I assume you get the
same message text for 1.2.26.
Here I guess: since there was no reply_timeout set, the socket_timeout
fires after 10 seconds, aborts the wait and resets the connection. If
you can log response times with IIS, you could check, if they are above
10 seconds. You can also log response times with an appropriate
JkRequestLogFormat.
You could set a reply timeout to a huge value, like eg. 60 seconds, if
you think that even under load *all* requests should return in less than
60 seconds. We can optimize this setting later (with max_reply_timeouts
in 1.2.26).
You could try TCP tuning like in
http://support.microsoft.com/kb/191143
but I doubt, that this will resolve the root consequence.
[Wed Jul 02 15:44:51.040 2008] [11604:12396] [error]
ajp_get_reply::jk_ajp_common.c (1658): (ajp13w) Tomcat is down or
refused connection. No response has been sent to the client (yet)
[Wed Jul 02 15:44:51.040 2008] [11604:12396] [info]
ajp_service::jk_ajp_common.c (2046): (ajp13w) receiving from tomcat
failed, recoverable operation attempt=0
[Wed Jul 02 15:44:51.040 2008] [11604:12396] [info]
ajp_service::jk_ajp_common.c (2085): (ajp13w) sending request to tomcat
failed, recoverable operation attempt=1
Those are followup messages. Next line is for another thread ID, so it
comes from handling another request. It can't even connect, so I expect
all threads to be busy (see thread dump).
[Wed Jul 02 15:44:51.946 2008] [11604:13156] [error]
ajp_service::jk_ajp_common.c (2097): (ajp13w) Connecting to tomcat
failed. Tomcat is probably not started or is listening on the wrong port
[Wed Jul 02 15:44:51.962 2008] [11604:13156] [info]
service::jk_lb_worker.c (1128): service failed, worker ajp13w is in
error state
[Wed Jul 02 15:44:51.962 2008] [11604:13156] [info]
service::jk_lb_worker.c (1191): All tomcat instances are busy or in
error state
[Wed Jul 02 15:44:51.962 2008] [11604:13156] [error]
HttpExtensionProc::jk_isapi_plugin.c (1537): service() failed with http
error 503
Included from your other post:
Additional information:
When using version 1.2.26 of the connector, these messages appear in
isapi_redirect.log before the messages I quoted in my previous mail:
[Wed Jul 02 15:39:34.690 2008] [11800:10604] [info] service::jk_lb_worker.c
(1221): Forcing recovery once for 1 workers
[Wed Jul 02 15:39:34.690 2008] [11800:10604] [error]
ajp_connection_tcp_send_message::jk_ajp_common.c (965): sendfull returned
-53 (errno=53)
Aha, if this is really coming before the error "60", then you should aso
look at:
http://support.microsoft.com/kb/931319/
Unfortunately the description is not very technically precise.
[Wed Jul 02 15:39:34.690 2008] [11800:10604] [info]
ajp_handle_cping_cpong::jk_ajp_common.c (803): can't send cping query
[Wed Jul 02 15:39:34.690 2008] [11800:10604] [error]
ajp_connect_to_endpoint::jk_ajp_common.c (904): (ajp13w) cping/cpong after
connecting to the backend server failed (errno=53)
[Wed Jul 02 15:39:34.690 2008] [11800:10604] [error]
ajp_send_request::jk_ajp_common.c (1359): (ajp13w) connecting to backend
failed. Tomcat is probably not started or is listening on the wrong port
(errno=53)
[Wed Jul 02 15:39:34.690 2008] [11800:10604] [info]
ajp_service::jk_ajp_common.c (2186): (ajp13w) sending request to tomcat
failed (recoverable), because of error during request sending (attempt=1)
...
Nothing important seems to be in the stdout logs right before the
problems occur. Only when I was using 1.2.26 of the connector, lots of
these messages show up in the stdout:
Jul 2, 2008 1:14:17 PM org.apache.jk.common.ChannelSocket
processConnection
WARNING: processCallbacks status 2
Jul 2, 2008 1:14:17 PM org.apache.jk.core.MsgContext action
WARNING: Error sending end packet
java.net.SocketException: Software caused connection abort: socket write
error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at
org.apache.jk.common.ChannelSocket.send(ChannelSocket.java:531)
at
org.apache.jk.common.JkInputStream.endMessage(JkInputStream.java:112)
at org.apache.jk.core.MsgContext.action(MsgContext.java:293)
at org.apache.coyote.Response.action(Response.java:182)
at org.apache.coyote.Response.finish(Response.java:304)
at
org.apache.catalina.connector.OutputBuffer.close(OutputBuffer.java:281)
at
org.apache.catalina.connector.Response.finishResponse(Response.java:476)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1
51)
at
org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:199)
at
org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:282)
at
org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:767)
at
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:
697)
at
org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.
java:889)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool
.java:684)
at java.lang.Thread.run(Thread.java:595)
If I'm right above, and the socket_timeout closes the waiting for
response, then Tomcat can't send back the response, once the Timeout fired.
I'm starting to be pretty desperate, after almost two weeks of daily
reboots and an angry customer.
A suggestion, anyone? Thanks in advance.
Maybe too many suggestions and not a straight solution, but if you are
able to collect more information, we should be able to sort this out.
Do others have the same issue on Windows? Did they find a solution?
Regards,
Rainer
---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]