Hey Onur, I was just watching your talk on rebalancing from last year - https://www.youtube.com/watch?v=QaeXDh12EhE Nice talk!.
I think I have an idea as to why it takes 1 hr in my case based on the talk in the video. In my case with 32 boxes / consumers from the same group, I believe the current state of the group coordinator's state machine gets messed up each time a new one is added until the very last consumer. Also I have a heartbeat set to 97 seconds (97 secs b/c normal processing could take that long and we don't want coordinator to think consumer is dead). I think both of these coupled together is why the cluster restart takes > 1hr. I'm curious how linkedin does clean cluster restarts? How do you handle the scenario described above? Praveen On Wed, Feb 15, 2017 at 10:22 AM, Praveen <praveev...@gmail.com> wrote: > I still think a clean cluster start should not take > 1 hr for balancing > though. Is this expected or am i doing something different? > > I thought this would be a common use case. > > Praveen > > On Fri, Feb 10, 2017 at 10:26 AM, Onur Karaman < > okara...@linkedin.com.invalid> wrote: > >> Pradeep is right. >> >> close() will try and send out a LeaveGroupRequest while a kill -9 will >> not. >> >> On Fri, Feb 10, 2017 at 10:19 AM, Pradeep Gollakota <pradeep...@gmail.com >> > >> wrote: >> >> > I believe if you're calling the .close() method on shutdown, then the >> > LeaveGroupRequest will be made. If you're doing a kill -9, I'm not sure >> if >> > that request will be made. >> > >> > On Fri, Feb 10, 2017 at 8:47 AM, Praveen <praveev...@gmail.com> wrote: >> > >> > > @Pradeep - I just read your thread, the 1hr pause was when all the >> > > consumers where shutdown simultaneously. I'm testing out rolling >> restart >> > > to get the actual numbers. The initial numbers are promising. >> > > >> > > `STOP (1) (1min later kicks off) -> REBALANCE -> START (1) -> >> REBALANCE >> > > (takes 1min to get a partition)` >> > > >> > > In your thread, Ewen says - >> > > >> > > "The LeaveGroupRequest is only sent on a graceful shutdown. If a >> > > consumer knows it is going to >> > > shutdown, it is good to proactively make sure the group knows it >> needs to >> > > rebalance work because some of the partitions that were handled by the >> > > consumer need to be handled by some other group members." >> > > >> > > So does this mean that the consumer should inform the group ahead of >> > > time before it goes down? Currently, I just shutdown the process. >> > > >> > > >> > > On Fri, Feb 10, 2017 at 8:35 AM, Pradeep Gollakota < >> pradeep...@gmail.com >> > > >> > > wrote: >> > > >> > > > I asked a similar question a while ago. There doesn't appear to be a >> > way >> > > to >> > > > not triggering the rebalance. But I'm not sure why it would be >> taking > >> > > 1hr >> > > > in your case. For us it was pretty fast. >> > > > >> > > > https://www.mail-archive.com/users@kafka.apache.org/msg23925.html >> > > > >> > > > >> > > > >> > > > On Fri, Feb 10, 2017 at 4:28 AM, Krzysztof Lesniewski, Nexiot AG < >> > > > krzysztof.lesniew...@nexiot.ch> wrote: >> > > > >> > > > > Would be great to get some input on it. >> > > > > >> > > > > - Krzysztof Lesniewski >> > > > > >> > > > > >> > > > > On 06.02.2017 08:27, Praveen wrote: >> > > > > >> > > > >> I have a 16 broker kafka cluster. There is a topic with 32 >> > partitions >> > > > >> containing real time data and on the other side, I have 32 boxes >> w/ >> > 1 >> > > > >> consumer reading from these partitions. >> > > > >> >> > > > >> Today our deployment strategy is stop, deploy and start the >> > processes >> > > on >> > > > >> all the 32 consumers. This triggers re-balancing and takes a long >> > > period >> > > > >> of >> > > > >> time (> 1hr). Such a long pause isn't good for real time >> processing. >> > > > >> >> > > > >> I was thinking of rolling deploy but I think that will still >> cause >> > > > >> re-balancing b/c we will still have consumers go down and come >> up. >> > > > >> >> > > > >> How do you deploy to consumers without triggering re-balancing >> (or >> > > > >> triggering one that doesn't affect your SLA) when doing real time >> > > > >> processing? >> > > > >> >> > > > >> Thanks, >> > > > >> Praveen >> > > > >> >> > > > >> >> > > > > >> > > > >> > > >> > >> > >