RE: mod_jk Problems - - worker went to error state and dont recover

luke.walshe Thu, 21 Feb 2008 01:32:41 -0800

Thanks, I did try to unsubscribe but I kept getting them. Will try the
address below.


Luke Walshe
BT Operate, HGIPCC Technical Specialist
Telephone: +44 (0)1314483482, Email: [EMAIL PROTECTED] 


-----Original Message-----
From: Rainer Jung [mailto:[EMAIL PROTECTED] 
Sent: 21 February 2008 09:30
To: Tomcat Users List
Subject: Re: mod_jk Problems - - worker went to error state and dont
recover

See the footer of any mail on the list:

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[EMAIL PROTECTED] wrote:
> All
> 
> Apologies, this is unrelated. How do I unsubscribe from this mailing
> list, I thought it would be useful and small but its overwhelming my
> inbox?
> 
> Thanks in Advance.
> 
> Luke Walshe
> BT Operate, HGIPCC Technical Specialist
> Telephone: +44 (0)1314483482, Email: [EMAIL PROTECTED] 
> 
> -----Original Message-----
> From: Ahmed Musa [mailto:[EMAIL PROTECTED] 
> Sent: 21 February 2008 09:25
> To: Tomcat Users List
> Subject: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
> Hello Rainer,
> Thanks for your informations - the Situation gets more clear now.
> I will read again some dics - following your links and will make
further
> tests also with the improved logging.
> Thanks a lot for your time
> with best regards 
> ahmed
> 
> -------- Original-Nachricht --------
>> Datum: Wed, 20 Feb 2008 18:59:01 +0100
>> Von: Rainer Jung <[EMAIL PROTECTED]>
>> An: Tomcat Users List <users@tomcat.apache.org>
>> Betreff: Re: mod_jk Problems - - worker went to error state and dont
> recover
> 
>> Ahmed Musa wrote:
>>> Hello,
>>> Wow -thank you very much Rainer for your very quick and informative
>> answer.
>>> I will go to 1.2.26 and think about some "smoother" Values for
>> reply_timeout and max_reply_timeouts.
>>> I will search for the requests which causes the Problems - becasue i
>> still log the response time in your mentioned way - but I am not sure
> that the
>> Userrequests are responsible for the Situation. 
>>
>> One note: for Apache httpd 2.x %d is microseconds (there is no format

>> for milliseconds), for Tomcat %D is milliseconds. As long as you are 
>> searching for the root cause, it might make sense to have both access

>> logs active to check about duration differences.
>>
>>> So one further question - does mod_jk itself checks if the Backend
> is
>> reachable - without userrequests? 
>>
>> No. Everything only works on top of user requests.
>>
>>> When there are connections to the Backend - are they closed after
> the
>> respone or are the hold open for further requests.
>>
>> In general hold open. There are parameters on how long they are held 
>> open without more requests before they get shut down, and also how
> many 
>> might be kept open even when no requests are coming in. Those are the

>> connection pool parameters, which you will find on
>>
>> http://tomcat.apache.org/connectors-doc/reference/workers.html
>>
>> Tomcat also has a connectionTimeout on the connector, which will shut

>> down a connection from the Tomcat side if it is idle for to long.
>>
>> If you don't want to reuse connections at all, there's also a setting
> (a 
>> JkOption in Apache).
>>
>>> Is it possible that the Checkpoint Firewall in Between can be
>> responsible for the connectivity problem?
>>
>> It can cut a connection that's idle for too long. Since you have 
>> cping/cpong active via connect_timeout and prepost_timeout, you
should
> 
>> get a cping error message, if the connection was dropped by the
> firewall 
>> during idle times and mod_jk tries to use it again. The reply timeout
> in 
>> the error log indicates, that the backend isn't answering. Of course
> if 
>> it takes *very* long to answer, it might be that the firewall dropped

>> the connection in between, but then the root cause would still be the

>> long response time of the backend.
>>
>>> Another point is the "not recovering" of the worker. Yes, you are
> right
>> - in this situation i have many reply_timeouts - but these happens in
> a
>> period of time - for example 30 minutes - but the worker is still
dead
> even
>> then when there are no more reply_timeouts. It remains dead.
>>> It was necessary to restart it manually via jkstatus.
>> I assume you are using stickyness, so when a session started on a
> node, 
>> it will stay there. So when a worker is in error for a long time, all

>> new sessions will start on other nodes. If the worker is ready for 
>> recovery, it needs a request, that doesn't carry a session to get
> probed 
>> with this request.
>>
>> In jkstatus, the status of an error worker should switch to REC, when

>> mod_jk decides that it could send a non-sticky request there (to
> probe) 
>> and to PRB, during the time this request is on the node, and finally 
>> either to OK or back to ERR depending on the result of the request.
>>
>> You can log the number of errors (and accesses) that happened on the 
>> node in the httpd access log. If you think that the node simply stays
> in 
>> error for a long time, then the error count (and access count) should

>> stay constant. I would expect, that they do not.
>>
>> Have a look at how LogFormat in Apache httpd works, and then add some
> of 
>> those documented in
>>
>> http://tomcat.apache.org/connectors-doc/reference/apache.html
>>
>> like:
>>
>> JK_LB_LAST_NAME
>> JK_LB_LAST_ACCESSED
>> JK_LB_LAST_ERRORS
>> JK_LB_LAST_BUSY
>> JK_LB_LAST_STATE
>>
>> using the syntax %{JK_LB_LAST_STATE}n etc.
>>
>>> Another point is the learning - i read the dics - the infos on the
>> apache Website i dont't find other ones - are there other ones ? -
and
> they are
>> not going in depth - if you read the spec and watch the logs it is -
> for me
>> - very hard to match the things. Also the many possibilities that
> mod_jk
>> has to prove if there is a connection to the Backend,... - i
> understand them
>> but check the reality in an error situation is very hard. Under
> matching i
>> mean "Which Part of the Communication sequence failed - why - and
> causes
>> which error message".
>>> But i will try - and study also the mailing list..
>> It's hard for us too (sometimes).
>>
>>> Thank you for your time - tomorrow we will have the new version and
> will
>> see what happens.
>>> best
>>> ahmed
>>
>> Regards,
>>
>> Rainer
>>
>>> -------- Original-Nachricht --------
>>>> Datum: Wed, 20 Feb 2008 15:56:42 +0100
>>>> Von: Rainer Jung <[EMAIL PROTECTED]>
>>>> An: Tomcat Users List <users@tomcat.apache.org>
>>>> Betreff: Re: mod_jk Problems - - worker went to error state and
> dont
>> recover
>>>> [EMAIL PROTECTED] wrote:
>>>>> See Thread at: http://www.techienuggets.com/Detail?tx=25608 Posted
> on
>>>> behalf of a User
>>>>> Hallo to all, After long unsuccessful research i hope someone can
>>>>> give me a hint to the following problems.
>>>>>
>>>>> Our Apache-mod_jk-Tomcat Infrastructur was running without
> Problems
>>>>> for about one year-than since two month mod_jk errors occurs.
>>>>> We upgraded the mod_jk Version, made improvements in the
>>>>> worker.properties - the problems changed and get less but
> sometimes
>> they
>>>>> appear further on.
>>>>>
>>>>> It seems that the mod_jk worker loose the connection to their
>>>>> Tomcat-Backendserver - there are messages in the mod_jk log Files
>> which
>>>>> points in this direction. Normally this seems not to be a big
> problem
>> -
>>>>> but under certain conditions (which ?) the worker goes to an error
>> state
>>>>> and cannot recover itself- must be done manually.
>>>>>
>>>>> Problem 1: The Tomcats are reachable - unknown why the workers
> think
>> the
>>>> server is dead ?
>>>>> Problem 2: I have no idea why the worker goes to an error state
> and
>>>> cannot recover.
>>>>
>>>> 2 is a consequence of 1
>>>>
>>>>> Problem3: I miss explanations of logged messages - i read the
> messages
>> -
>>>> but cannot match them to the situation - when does a worker post
> this
>>>> messages
>>>>
>>>> 1 is a consequence of these messages
>>>>
>>>>> [Wed Feb 20 10:04:01.889 2008] [19237:3086010048] [info]
>>>> jk_handler::mod_jk.c (2270): Aborting connection for worker=ajp_ggi
> 
>>>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>>>> ajp_get_reply::jk_ajp_common.c (1623): (INETP1011) Timeout with
> waiting
>> reply from
>>>> tomcat. Tomcat is down, stopped or network problems (errno=110)
>>>>> [Wed Feb 20 10:04:39.799 2008] [19294:3086010048] [error]
>>>> ajp_service::jk_ajp_common.c (2034): (INETP1011) receiving reply
> from
>> tomcat failed with
>>>> out recovery in send loop attempt=0
>>>>> [Wed Feb 20 10:04:41.799 2008] [19294:3086010048] [error]
>>>> service::jk_lb_worker.c (1105): unrecoverable error 504, request
>> failed. Tomcat failed in
>>>> the middle of request, we can't recover to another instance.
>>>>
>>>> The second line tells us, that your configured reply_timeout fired.
>>>> You set it to 120000 (2 minutes), so there are requests taking
> longer 
>>>> than 2 minutes on the backend, before the first response packet
> comes 
>>>> back from the backend.
>>>>
>>>> With your configuration mod_jk then doesn't wait any longer on the
>> reply 
>>>> *and puts the backend into error mode*.
>>>>
>>>> Up until version 1.2.25, if you use a reply-timeout, you need to
> set it
>>>> to a high number which justifies the resoning "if it takes that
> long, 
>>>> that something is wrong with the backend".
>>>>
>>>> Reality shows: there is no such number. Often there are few
> requests 
>>>> that take unaccetably long on the backend *although* the backend is
> 
>>>> still working.
>>>>
>>>> So in 1.2.25 we added max_reply_timeouts. With this set in addition
> to 
>>>> reply_timeout, mod_jk will abort waiting for a reply after 
>>>> reply_timeout, but allow some timeouts before actually deciding to
> put 
>>>> the backend into error.
>>>>
>>>> Unfortunately the implementation of max_reply_timeouts in 1.2.25
> was 
>>>> wrong, so you need to go to 1.2.26 to get it working right.
>>>>
>>>> See:
>>>>
>>>> http://issues.apache.org/bugzilla/show_bug.cgi?id=43229
>>>>
>>>> Caution: this does *not* explain, why the backends are not
>> automatically 
>>>> recovered after a minute of error condition. Maybe you have times,
>> where 
>>>> you getr to many of those reply_timeouts (see log file), and
> although
>> we 
>>>> recover after a minute the backend almost immediately goes back
> into 
>>>> error status.
>>>>
>>>>> -> Which Timeout - how does mod_jk think Tomcat is down ? Where
> can i
>>>> found details to errno=110 ?...
>>>>
>>>> reply_timeout, see above and also
>>>>
>>>> http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
>>>>
>>>> errno: a standard unix feature. The numbers are platform dependent.
> I 
>>>> would assume in your case
>>>>
>>>> ETIMEDOUT       110     /* Connection timed out */
>>>>
>>>> so no wonder, that's exactly what we expect (and doesn't tell us
> the 
>>>> reason, i.e. what's wrong on the *backend* taking that long for a
>>>> response).
>>>>
>>>>> -> receiving reply from tomcat failed with out recovery in send
> loop
>>>> attempt=0  - ? with out recovery in send loop - means?
>>>>
>>>> That your configuration doesn't allow us to send the request to
> another
>>>> backend. recovery_options 7 include: if mod_jk was able to send the
> 
>>>> request to a backend, do not try to send it to another backend in
> case 
>>>> of an error during the response handling. Even if you would allow 
>>>> sending to another backend, it would not help with *not* putting
> the 
>>>> worker into error state. More likely would be, that you would put
> all 
>>>> workers into error state, because all of them might run into the
> same 
>>>> timeout, one after the other.
>>>>
>>>>> -> unrecoverable error 504 - details to this error ?
>>>> That's simply how we return the situation back to the client
> (browser).
>>>>> Ok - i turn the logging level to debug - the course of events get
>>>>> more
>>>>> clear - but also more questions appear - there are socket numbers
> -
>>>>> which sockets - what are these numbers e.g will be shutting down
>> socket
>>>>> 35 for worker INETP1021 - The sockets are good for ? - how many
> are
>>>>> there/per worker ? can i configure them ?
>>>> Should not be the problem here. For apache httpd if you do *not* 
>>>> configure anything, we automatically choose the number of httpd
> threads
>>>> as the maximum number of connections. No need to change anything
> here.
>>>>> => Generally -How can i solve such problems - i tried to look into
>>>>> the
>>>>> mod_jk code - searching for error codes, error messages - but
> cannot
>>>>> find some relevant informations, - i am studying the log Files -
> but
>>>>> don't find out what really happens.
>>>> Post to the list. Improve our dics.
>>>>
>>>> The error message contains the word "timeout" and "reply" and you
> have
>> a 
>>>> "reply_timeout".
>>>>
>>>> Long running requests are a frequent problem. If you want to get
> rid of
>>>> them, start by adding response times to your httpd and your tomcat 
>>>> access log format (%D). Then have a look, which URLs are producing
> long
>>>> running requests, during what time of day are they happening etc.
> This 
>>>> might give you a clue about the reasons.
>>>>
>>>> And if they are very frequent: do Java Thread Dumps of your
> backends
>> and 
>>>> analyze them.
>>>>
>>>>> So - maybe someone has an idea why the worker think that the
>>>>> corresponding Tomcat is dead, and why he will not recover by
> itself. !
>>>> Tomecat is dead: from the point of view of mod_jk it simply means:
> we 
>>>> didn't get an answer, when we expected one. Details depend on the 
>>>> additional log lines (could not connect, reply timeout etc.).
>>>>
>>>>> And i am also searching for tips how i can help myself - and where
> to
>>>>> find something about the error codes, messages,..in mod_jk
>>>>>
>>>>> thanks for your attention
>>>>> Best
>>>>> ahmed musa (writing from vienna)
>>>>>
>>>> Regards,
>>>>
>>>> Rainer
>>>>
>>>>> Current Infrastructur
>>>>> We have 3 Apache Webserver (2.2.6) -based on CentOS release 4.3
>>>> /Kernelversion 2.6.9-34
>>>>> In front of the Webserver there are two (two Locations)
>> HW-Loadbalancer
>>>> (but they have no role in this story)
>>>>> The Webservers are hosted at our ISP.
>>>>>  
>>>>> The Webserver balance the requests via mod_jk (Version 1.2.25) for
>>>>> approx. 10 Webapps to 18 Backend-Tomcatserver (Bladeserver -
> because
>> of
>>>>> underlying Application-Parts the OS is Windows 2003 Server - a
> long
>>>>> story not worth to explain :-) ). The Tomcatserver gain Data via
>>>>> Requests against DB2 Server/DB2-Databases on the Mainframe. The
>>>>> Tomcatserver are Inhouse -and were rebooted nightly because of
>> automated
>>>>> Deployment processes.
>>>>>
>>>>> Between the Webserver and the Tomcatserver is a Checkpoint
> Firewall. 
>>>>> All webapps are deployed on all Tomcats - only mod_jk manages the
>>>>> requests to certain Tomcat- instances.
>>>>> (on one Bladeserver there are two identically Tomcat Instances
>>>>> running).
>>>>>
>>>>> Versions: Tomcat - 5.5.17_11, JDK 1.5.0_11-b03. The requests
> against
>>>>> the public Website(s) are normal short living requests - not many
> -
>> The
>>>>> most Webapps (Portals) need a login, have a strong focus on
> business
>>>>> logic - so the instances are big (many MBs in RAM), the sessions
> are
>>>>> sticky and the session timeout is 20 minutes. But there are also
> less
>>>>> requests. To the User requests - Monitoring requests from our ISP
> are
>>>> added.
>>>>> The Problems appears at Servers/Portals which very less
> Userrequests.
>>>>> worker.properties
>>>>> worker.list=ajp_bam,ajp_ggi,ajp_ad,ajp_svp,.......,jkstatus
>>>>>
>>>>> worker.template.type=ajp13
>>>>> worker.template.lbfactor=5
>>>>> worker.template.socket_keepalive=1
>>>>> worker.template.connect_timeout=7000
>>>>> worker.template.prepost_timeout=5000
>>>>> worker.template.reply_timeout=120000
>>>>> worker.template.retries=6
>>>>> worker.template.activation=Active
>>>>> worker.template.recovery_options=7
>>>>>
>>>>> worker.lbtemplate.type=lb
>>>>> worker.lbtemplate.max_reply_timeouts=6
>>>>> worker.lbtemplate.method=Session
>>>>>
>>>>> #Produktions Worker
>>>>> # AS-INETP101 - 106 - 6/6 GGI
>>>>> worker.INETP1011.host=AS-INETP101.AEAT.ALLIANZ.AT
>>>>> worker.INETP1011.port=65001
>>>>> worker.INETP1011.reference=worker.template
>>>>>
>>>>> ....many more of the same
>>>>>
>>>>> then
>>>>>
>>>>> worker.ajp_ad.reference=worker.lbtemplate
>>>>> worker.ajp_ad.balance_workers=INETP1032,INETP1062
>>>>>
>>>>> .... many more portals
>>>>>
>>>>> at least jkstatus
>>>>>
>>>>> The JKMount is very simple
>>>>> JkMount /* ajp_ad    --- for the other portals mostly the same
>>>>>
>>>>> The Portals are Virtual Hosts on the Apache.
>>>>>
>>>>> Tomcat - server.xml
>>>>> example
>>>>> <Connector port="65001" maxThreads="300" protocol="AJP/1.3" />
>>>>>     <Engine name="Catalina" jvmRoute="INETP5021"
>> defaultHost="default">
>>>>> ......
>>>>> <Host name="slfinsol.com" appBase="webapps" unpackWARs="true"
>>>>> autoDeploy="false" deployOnStartup="false" xmlValidation="false"
>>>>> xmlNamespaceAware="false">
>>>>>         <Alias>www.slfinsol.com</Alias>
>>>>>         <Alias>web1.slfinsol.com</Alias>
>>>>>         ...
>>>>>         <Alias>testweb.slfinsol.com</Alias>
>>>>>         .....
>>>>>         <Valve
> className="org.apache.catalina.valves.AccessLogValve"
>>>>> directory="logs" prefix="swl_access_log." suffix=".txt"
>> pattern="common"
>>>>> resolveHosts="false" />
>>>>>         <Valve
>>>>> className="at.allianz.tomcat.valve.RequestTimeValve"/>
>>>>>         <Valve
>>>>>
> className="at.allianz.tomcat.valve.WebcollaborationWorkaroundValve"/>
>>>>>         <Context path="" docBase="swl" />
>>>>>         <Context path="/monitor5" docBase="monitor" />
>>>>>         <Context path="/swl" docBase="swl" />
>>>>>       </Host>    
>> ---------------------------------------------------------------------
>> To start a new topic, e-mail: users@tomcat.apache.org
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: mod_jk Problems - - worker went to error state and dont recover

Reply via email to