Hi Jason Thanks for looking into this.
Created this: https://issues.apache.org/jira/browse/KAFKA-3470 First time do it. Not sure if I have followed any necessary convention there. To your question: No, this should not be a *significant* problem for any client as there are workarounds (commit less, increase session timeout etc.) However, since heartbeats and commits are sent in the same socket, (especially when network latency is high), intensive commit requests (sync or async) may steal heartbeats' time slots. Fixing this should improv group stability for all clients. Regards -Zaiming On Sat, Mar 26, 2016 at 12:16 AM, Jason Gustafson <ja...@confluent.io> wrote: > Hi Zaiming, > > Yeah, you're right. Changing coordinator won't cause a rebalance (it hasn't > been that way since we added group metadata persistence). I went back and > checked the code and we actually do not reset the heartbeat timer when a > commit is received. I'm not sure whether there's a good reason for that, > but nothing is coming to mind. At least when the group is stable, the > commit could be treated as an implicit heartbeat. Feel free to create a > JIRA and we can see what others think. Out of curiosity, is this a > significant problem for the Erlang client you're writing? > > -Jason > > On Fri, Mar 25, 2016 at 1:38 PM, Zaiming Shi <zmst...@gmail.com> wrote: > > > Hi Jason > > > > If I understand correctly, when coordinator is changed the consumer > > should get 'NotCoordinatorForGroup' exception not 'IllegalGenerationId'. > > Topic metadata change? like number of partitions changed ? > > I was testing it in a pretty stable cluster, and it was reproduced > several > > times, > > I had no such issue if we change session timeout to 3 minutes. > > --- does this rule out the topic metadata change? > > > > The logs are lost because I was running debug mode in our Erlang client > to > > help debugging this issue for my colleague who's using the new Java > client. > > My colleague has observed very likely the same pattern as I described > > above. > > He is trying to get on hold a minimal setup for a reliable reproduction. > > > > I will also try to reproduce it in Erlang, and post here a (hopefully > > sensible) > > sequence of timestamped heartbeat and commit requests and responses. > > > > Will ask more questions if we have new findings. > > > > Regards > > -Zaiming > > > > > > > > On Fri, Mar 25, 2016 at 5:43 PM, Jason Gustafson <ja...@confluent.io> > > wrote: > > > > > Hi Zaiming, > > > > > > It rules out the most likely cause of rebalance, but not the only one. > > > Rebalances can also be caused by a topic metadata change or a > coordinator > > > change. Can you post some logs from the consumer around the time that > the > > > unexpected rebalance occurred? > > > > > > -Jason > > > > > > On Fri, Mar 25, 2016 at 12:09 AM, Zaiming Shi <zmst...@gmail.com> > wrote: > > > > > > > Hi Jason > > > > > > > > thanks for the reply! > > > > > > > > Forgot to mention that in we tried to test the simplest scenario in > > which > > > > there was only one member in the group. I think that should rule out > > > group > > > > rebalancing right? > > > > > > > > On Thursday, March 24, 2016, Jason Gustafson <ja...@confluent.io> > > wrote: > > > > > > > > > HI Zaiming, > > > > > > > > > > I think the problem is not that commit requests aren't considered > as > > > > > effective as heartbeats (they are), but that you can't rejoin the > > group > > > > > using only commits/heartbeats. Every time the group rebalances, all > > > > members > > > > > must rejoin the group by sending a JoinGroup request. Once a > > rebalance > > > > has > > > > > begun (e.g. because a new consumer has been started), then each > > member > > > > must > > > > > send the JoinGroup before expiration of the session timeout. If > not, > > > then > > > > > they will be kicked out of the group even if they are still sending > > > > > heartbeats. Does that make sense? > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > On Wed, Mar 23, 2016 at 10:03 AM, Zaiming Shi <zmst...@gmail.com > > > > > <javascript:;>> wrote: > > > > > > > > > > > Hi there! > > > > > > > > > > > > We have noticed that when committing requests are sent > intensively, > > > we > > > > > > receive IllegalGenerationId. > > > > > > Here is the settings we had problem with: session-timeout: 30 > sec, > > > > > > heartbeat-rate: 3 sec. > > > > > > Problem resolved by increasing the session timeout to 180 sec. > > > > > > > > > > > > So I suppose, due to whatever reason (either the client didn't > send > > > > > > heartbeat, or the broker didn't process the heartbeats in time), > > the > > > > > > session was considered dead in group coordinator. > > > > > > > > > > > > My question is: why commit requests can't be taken as an > indicator > > of > > > > > > member being alive? hence not to kill the session. > > > > > > > > > > > > Regards > > > > > > -Zaiming > > > > > > > > > > > > > > > > > > > > >