Hey Onur,

I was just watching your talk on rebalancing from last year -
https://www.youtube.com/watch?v=QaeXDh12EhE
Nice talk!.

I think I have an idea as to why it takes 1 hr in my case based on the talk
in the video. In my case with 32 boxes / consumers from the same group, I
believe the current state of the group coordinator's state machine gets
messed up each time a new one is added until the very last consumer. Also I
have a heartbeat set to 97 seconds (97 secs b/c normal processing could
take that long and we don't want coordinator to think consumer is dead). I
think both of these coupled together is why the cluster restart takes >
1hr. I'm curious how linkedin does clean cluster restarts? How do you
handle the scenario described above?

Praveen


On Wed, Feb 15, 2017 at 10:22 AM, Praveen <praveev...@gmail.com> wrote:

> I still think a clean cluster start should not take > 1 hr for balancing
> though. Is this expected or am i doing something different?
>
> I thought this would be a common use case.
>
> Praveen
>
> On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman <
> okara...@linkedin.com.invalid> wrote:
>
>> Pradeep is right.
>>
>> close() will try and send out a LeaveGroupRequest while a kill -9 will
>> not.
>>
>> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota <pradeep...@gmail.com
>> >
>> wrote:
>>
>> > I believe if you're calling the .close() method on shutdown, then the
>> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure
>> if
>> > that request will be made.
>> >
>> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen <praveev...@gmail.com> wrote:
>> >
>> > > @Pradeep - I just read your thread, the 1hr pause was when all the
>> > > consumers where shutdown simultaneously.  I'm testing out rolling
>> restart
>> > > to get the actual numbers. The initial numbers are promising.
>> > >
>> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) ->
>> REBALANCE
>> > > (takes 1min to get a partition)`
>> > >
>> > > In your thread, Ewen says -
>> > >
>> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a
>> > > consumer knows it is going to
>> > > shutdown, it is good to proactively make sure the group knows it
>> needs to
>> > > rebalance work because some of the partitions that were handled by the
>> > > consumer need to be handled by some other group members."
>> > >
>> > > So does this mean that the consumer should inform the group ahead of
>> > > time before it goes down? Currently, I just shutdown the process.
>> > >
>> > >
>> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota <
>> pradeep...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > I asked a similar question a while ago. There doesn't appear to be a
>> > way
>> > > to
>> > > > not triggering the rebalance. But I'm not sure why it would be
>> taking >
>> > > 1hr
>> > > > in your case. For us it was pretty fast.
>> > > >
>> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG <
>> > > > krzysztof.lesniew...@nexiot.ch> wrote:
>> > > >
>> > > > > Would be great to get some input on it.
>> > > > >
>> > > > > - Krzysztof Lesniewski
>> > > > >
>> > > > >
>> > > > > On 06.02.2017 08:27, Praveen wrote:
>> > > > >
>> > > > >> I have a 16 broker kafka cluster. There is a topic with 32
>> > partitions
>> > > > >> containing real time data and on the other side, I have 32 boxes
>> w/
>> > 1
>> > > > >> consumer reading from these partitions.
>> > > > >>
>> > > > >> Today our deployment strategy is stop, deploy and start the
>> > processes
>> > > on
>> > > > >> all the 32 consumers. This triggers re-balancing and takes a long
>> > > period
>> > > > >> of
>> > > > >> time (> 1hr). Such a long pause isn't good for real time
>> processing.
>> > > > >>
>> > > > >> I was thinking of rolling deploy but I think that will still
>> cause
>> > > > >> re-balancing b/c we will still have consumers go down and come
>> up.
>> > > > >>
>> > > > >> How do you deploy to consumers without triggering re-balancing
>> (or
>> > > > >> triggering one that doesn't affect your SLA) when doing real time
>> > > > >> processing?
>> > > > >>
>> > > > >> Thanks,
>> > > > >> Praveen
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to