Hey Colin, Guozhang, I agree the current state names are not ideal for end users. I tend to see the rebalance as joining the group and receiving the assignment. Maybe the states could be named in those terms? For example: RebalanceJoin and RebalanceAssignment. What do you think?
-Jason On Fri, Jul 28, 2017 at 11:18 AM, Guozhang Wang <wangg...@gmail.com> wrote: > I feel we can change `AwaitSync` to `completeRebalance` while keeping the > other as is. > > cc Jason? > > Guozhang > > On Fri, Jul 28, 2017 at 10:08 AM, Colin McCabe <cmcc...@apache.org> wrote: > >> Thanks for the explanation. I guess maybe we should just keep the group >> names as they are, then? >> >> best, >> Colin >> >> >> On Wed, Jul 26, 2017, at 11:25, Guozhang Wang wrote: >> > To me `PreparingRebalance` sounds better than `StartingRebalance` since >> > only by the end of that stage we have formed a new group. More >> > specifically, this this the workflow from the coordinator's point of >> > view: >> > >> > 1. decided to trigger a rebalance, enter PreparingRebalance phase; >> > | >> > | send out error code for all heartbeat reponses >> > \|/ >> > | >> > | waiting for join group requests from members >> > \|/ >> > 2. formed a new group, increment the generation number, now start >> > rebalancing, entering AwaitSync phase: >> > | >> > | send out the join group responses for whoever >> > requested join >> > \|/ >> > | >> > | waiting for the sync group request from the leader >> > \|/ >> > 3. received assignment from the leader; the rebalance has ended, start >> > ticking for all members, entering Stable phase. >> > | >> > | for whoever else sending the sync group request, >> > reply with the assignment >> > \|/ >> > >> > So from the coordinator's point of view the rebalance starts at >> beginning >> > of step 2 and ends at beginning of step 3. Maybe we can rename >> > `AwaitSync` >> > itself to `CompletingRebalance`. >> > >> > Guozhang >> > >> > >> > >> > On Tue, Jul 25, 2017 at 6:44 AM, Ismael Juma <ism...@juma.me.uk> wrote: >> > >> > > Hi Guozhang, >> > > >> > > Thanks for the clarification. The naming does seem a bit unclear. >> Maybe >> > > `PreparingRebalance` could be `StartingRebalance` or something that >> makes >> > > it clear that it is part of the rebalance instead of a step before the >> > > actual rebalance. `AwaitingSync` could also be `CompletingRebalance`, >> but >> > > not sure if that's better. >> > > >> > > Ismael >> > > >> > > On Mon, Jul 24, 2017 at 7:02 PM, Guozhang Wang <wangg...@gmail.com> >> wrote: >> > > >> > > > Actually Rebalancing includes two steps, and we name them >> > > PrepareRebalance >> > > > and WaitSync (arguably they may not be the best names). But these >> two >> > > steps >> > > > together should be treated as the complete rebalance cycle. >> > > > >> > > > >> > > > Guozhang >> > > > >> > > > On Mon, Jul 24, 2017 at 10:46 AM, Colin McCabe <cmcc...@apache.org> >> > > wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > I think maybe it makes sense to rename the "PreparingRebalance" >> > > consumer >> > > > > group state to "Rebalancing." To me, "Preparing" implies that >> there >> > > > > will be a later "rebalance" state that follows-- but there is not. >> > > > > Since we're now exposing this state name publicly in these >> metrics, >> > > > > perhaps it makes sense to do this rename now. Thoughts? >> > > > > >> > > > > best, >> > > > > Colin >> > > > > >> > > > > >> > > > > On Fri, Jul 21, 2017, at 13:52, Colin McCabe wrote: >> > > > > > That's a good point. I revised the KIP to add metrics for all >> the >> > > > group >> > > > > > states. >> > > > > > >> > > > > > best, >> > > > > > Colin >> > > > > > >> > > > > > >> > > > > > On Fri, Jul 21, 2017, at 12:08, Guozhang Wang wrote: >> > > > > > > Ah, that's right Jason. >> > > > > > > >> > > > > > > With that I can be convinced to add one metric per each state. >> > > > > > > >> > > > > > > Guozhang >> > > > > > > >> > > > > > > On Fri, Jul 21, 2017 at 11:44 AM, Jason Gustafson < >> > > > ja...@confluent.io> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > > >> > > > > > > > > "Dead" and "Empty" states are transient: groups usually >> only >> > > > > leaves in >> > > > > > > > this >> > > > > > > > > state for a short while and then being deleted or >> transited to >> > > > > other >> > > > > > > > > states. >> > > > > > > > >> > > > > > > > >> > > > > > > > This is not strictly true for the "Empty" state which we >> also use >> > > > to >> > > > > > > > represent simple groups which only use the coordinator to >> store >> > > > > offsets. I >> > > > > > > > think we may as well cover all the states if we're going to >> cover >> > > > > any of >> > > > > > > > them specifically. >> > > > > > > > >> > > > > > > > -Jason >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > On Fri, Jul 21, 2017 at 9:45 AM, Guozhang Wang < >> > > wangg...@gmail.com >> > > > > >> > > > > wrote: >> > > > > > > > >> > > > > > > > > My two cents: >> > > > > > > > > >> > > > > > > > > "Dead" and "Empty" states are transient: groups usually >> only >> > > > > leaves in >> > > > > > > > this >> > > > > > > > > state for a short while and then being deleted or >> transited to >> > > > > other >> > > > > > > > > states. >> > > > > > > > > >> > > > > > > > > Since we have the existing "*NumGroups*" metric, >> `*NumGroups - >> > > > > > > > > **NumGroupsRebalancing >> > > > > > > > > - **NumGroupsAwaitingSync`* should cover the above three, >> where >> > > > > "Stable" >> > > > > > > > > should be contributing most of the counts: If we have a >> bug >> > > that >> > > > > causes >> > > > > > > > the >> > > > > > > > > num.Dead / Empty to keep increasing, then we would observe >> > > > > `NumGroups` >> > > > > > > > keep >> > > > > > > > > increasing which should be sufficient for alerting. And >> trouble >> > > > > shooting >> > > > > > > > of >> > > > > > > > > the issue could be relying on the log4j. >> > > > > > > > > >> > > > > > > > > *Guozhang* >> > > > > > > > > >> > > > > > > > > On Fri, Jul 21, 2017 at 7:19 AM, Ismael Juma < >> > > ism...@juma.me.uk> >> > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Thanks for the KIP, Colin. This will definitely be >> useful. >> > > One >> > > > > > > > question: >> > > > > > > > > > would it be useful to have a metric for for the number >> of >> > > > groups >> > > > > in >> > > > > > > > each >> > > > > > > > > > possible state? The KIP suggests "PreparingRebalance" >> and >> > > > > > > > "AwaitingSync". >> > > > > > > > > > That leaves "Stable", "Dead" and "Empty". Are those not >> > > useful? >> > > > > > > > > > >> > > > > > > > > > Ismael >> > > > > > > > > > >> > > > > > > > > > On Thu, Jul 20, 2017 at 6:52 PM, Colin McCabe < >> > > > > cmcc...@apache.org> >> > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Hi all, >> > > > > > > > > > > >> > > > > > > > > > > I posted "KIP-180: Add a broker metric specifying the >> > > number >> > > > of >> > > > > > > > > consumer >> > > > > > > > > > > group rebalances in progress" for discussion: >> > > > > > > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confl >> uence/display/KAFKA/KIP- >> > > > > > > > > > > 180%3A+Add+a+broker+metric+specifying+the+number+of+ >> > > > > > > > > > > consumer+group+rebalances+in+progress >> > > > > > > > > > > >> > > > > > > > > > > Check it out. >> > > > > > > > > > > >> > > > > > > > > > > cheers, >> > > > > > > > > > > Colin >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > -- Guozhang >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > -- >> > > > > > > -- Guozhang >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > -- Guozhang >> > > > >> > > >> > >> > >> > >> > -- >> > -- Guozhang >> > > > > -- > -- Guozhang >