Re: jk_lb_worker.c patch

costinm Sat, 04 May 2002 18:23:13 -0700

Hi Mathias,

I think we understand your use case, it is not very uncommon.
In fact, as I mentioned few times, it is the 'main' use
case for Apache ( multi-process ) when using the JNI worker.
In this case Apache acts as a 'natural' load-balancer, with 
requests going to various processes ( more or less randomly ).
As in your case, requests without a session should allways go
to the worker that is in the same process.


The main reason for using '0' for the "local" worker is that
in jk2 I want to switch from float to int - there is no reason
( AFAIK ) to do all the float computation, even a short int
will be enough for the purpose of implementing a round-roubin
with weitghs.

BTW, one extension I'm trying to make is support for multiple
local workers - I'm still thining on how to do that. This will
cover the case of few big boxes, each with several tomcat 
instances ( if you have many G of RAM and many processors, sometimes
is better to run more VMs instead of a single large process ) 
In this case you still want some remote tomcats, for failover,
but most load should go to the local workers.

For jk2 I already fixed the selection of the 'recovering' worker,
after timeout the worker will go through normal selection instead
of beeing automatically chosen.

For jk1 - I'm waiting for patches :-) I wouldn't do a big change -
the current fix seemed like a good one. 

I agree that changing the meaning of 0 may be confusing ( is it
documented ? my workers.properties says it should never be used ).
We can fix that by using an additional flag - and not using 
special values.

Another special note - Jk2 will also support 'gracefull shutdown',
that means your case ( replacing a webapp ) will be handled
in a different way. You should be able to add/remove workers
without restarting apache ( and I hope mostly automated ). 

Let me know what you think - with patches if possible :-)

Costin

> The setup I use is the following, a load balancer (Alteon) is in front
> of several Apache servers, each hosted on a machine which also hosts a
> Tomcat.
> Let's call those Apache servers A1, A2 and A3 and the associated Tomcat
> servers T1, T2 and T3.
> 
> I have been using Paul's patch which I modified so the lb_value field of
> fault tolerant workers would not be changed to a value other than INF.
> 
> The basic setup is that Ai can talk to all Tj, but for requests not
> associated with a session, Ti will be used unless it is unavailable.
> Sessions belonging to Tk will be correctly routed. The load balancing
> worker definition is different for all three Ai, the lbfactor is set to
> 0 for workers connecting to Tk for all k != i and set to 1.0 for the
> worker connecting to Ti.
> 
> This setup allows to have sticky sessions independently of the Apache
> handling the request, which is a good thing since the Alteon cannot
> extract the ';jsessionid=.....' part from the URL in a way which allows
> the dispatching of the requests to the proper Ai (the cookie is dealed
> with correctly though).
> 
> This works perfectly except when we roll out a new release of our
> webapps. In this case it would be ideal to be able to make the load
> balancer ignore one Apache server, deploy the new version of the webapp
> on this server, and switch this server back on and the other two off so
> the service interruption would be as short as possible for the
> customers. The immediate idea, if Ai/Ti is to be the first server to
> have the new webapp, is to stop Ti so Ai will not be selected by the
> load balancer. This does not work, indeed with Paul's patch Ti is the
> preferred server BUT if Ti fails then another Tk will be selected by Ai,
> therefore the load balancer will never declare Ai failed (even though we
> managed to make it behave like this by specifying a test URL which
> includes a jvmroute to Ti, but this uses lots of slb groups on the
> alteon) and it will continue to send requests to it.
> 
> Bernd's patch allows Ai to reject requests if Ti is stopped, the load
> balancer will therefore quickly declare Ai inactive and will stop send
> it requests, thus allowing to roll out the new webapp very easily, just
> set up the new webapp, restart Ti, restart Ai, and as soon as the load
> balancer sees Ai, shut down the other two Ak, the current sessions will
> still be routed to the old webapp, and the new sessions will see the new
> version. When there are no more sessions on the old version, shut down
> Tk (k != i) and deploy the new webapp.
> 
> My remark concerning the possible selection of recovering workers prior
> to the local worker (one with lb_value set to 0) deals with the load
> balancer not being able in this case to declare Ai inactive.
> 
> I hope I have been clear enough, and that everybody got the point, if
> not I'd be glad to explain more thoroughly.
> 
> Mathias.
> 
> Paul Frieden wrote:
> > 
> > Hello,
> > 
> > I'm afraid that I am no longer subscribed to the devel list.  I would be
> > happy to add my advice for this issue, but I don't have time to keep up
> > with the entire devel list.  If there is anything I can do, please just
> > mail me directly.
> > 
> > I chose to use the value 0 for a worker because it used the inverse of
> > the value specified.  The value 0 then resulted in essentially infinite
> > preference.  I used that approach purely because it was the smallest
> > change possible, and the least likely to change the expected behavior
> > for anybody else.  The path of least astonishment and whatnot.  I would
> > be concerned about changing the current behavior now, because people
> > probably want a drop in replacement.  If there is going to be a change
> > in the algorithm and behavior, a different approach may be better.
> > 
> > I would also like to make a note of how we were using this code.  In our
> > environment, we have an external dedicated load balancer, and three web
> > servers.  The main problem that we ran into was with AOL users.  AOL
> > uses a proxy that randomizes the source IP of requests.  That means that
> > you can no longer count on the source IP to tell the load balancer which
> > server to send future requests to.  We used this code to allow sessions
> > that arive on the wrong web server to be redirected to the tomcat on the
> > correct server.  This neatly side-steps the whole issue of changing IPs,
> > because apache is able to make the decision based on the session ID.
> > 
> > The reliability issue was a nice side effect for us in that it caught a
> > failed server more quickly than the load balancer did, and prevented the
> > user from having a connection time out or seeing an error message.
> > 
> > I hope this provides some insight into why I changed the code that I
> > did, and why that behavior worked well for us.
> > 
> > Paul
> > 
> > [EMAIL PROTECTED] wrote:
> > 
> > >Hi Mathias,
> > >
> > >I think it would be better to discuss this on tomcat-dev.
> > >
> > >The 'error' worker will not be choosen unless the
> > >timeout expires. When the timeout expires, we'll indeed
> > >select it ( in preference to the default ) - this is easy to fix
> > >if it creates problems, but I don't see why it would be a
> > >problem.
> > >
> > >If it is working, next request will be served normally by
> > >the default. If not, it'll go back to error state.
> > >
> > >In jk2 I removed that - error workers are no longer
> > >selected. But for jk1 I would rather leave the old
> > >behavior intact.
> > >
> > >Note that the reason for choosing 0 ( in jk2 ) as
> > >default is that I want to switch from float to ints,
> > >I'm not convinced floats are good for performance
> > >( or needed ).
> > >
> > >Again - I'm just learning and trying, if you have
> > >any idea I would be happy to hear them, patches
> > >are more than wellcome.
> > >
> > >Costin
> > >
> > >On Sat, 4 May 2002, Mathias Herberts wrote:
> > >
> > >
> > >
> > >>Hi,  I  just  joined  the  Tomcat-dev  list  and  saw  your  patch  to
> > >>jk_lb_worker.c (making it version 1.9).
> > >>
> > >>If I understand well your patch it offers the same behaviors as Paul's
> > >>patch  but with  an opposite  semantic for  a lbfactor  of 0.0  in the
> > >>worker's definition,  i.e. a  value of 0.0  now means ALWAYS  USE THIS
> > >>WORKER FOR REQUESTS WITH NO  SESSIONS instead of NEVER USE THIS WORKER
> > >>FOR REQUESTS WITH NO SESSIONS. This seems fine to me.
> > >>
> > >>What disturbs  me is  what is  happening when one  worker is  in error
> > >>state  and not  yet recovering.  In get_most_suitable  worker,  such a
> > >>worker will  be selected whatever  its lb_value, meaning  a recovering
> > >>worker will  have priority over  one with a  lb_value of 0.0  and this
> > >>seems to break the behavior we had achieved with your patch.
> > >>
> > >>Did I miss something or is this really a problem?
> > >>
> > >>Mathias.
> > >>
> > >>
> > >>
> > >
> > >
> > >
> 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: jk_lb_worker.c patch

Reply via email to