Sounds good to me. On Wed, Aug 9, 2017 at 11:32 PM, Colin McCabe <cmcc...@apache.org> wrote:
> Hi Apurva, > > Thanks for taking another look. Responses below: > > On Mon, Aug 7, 2017, at 17:47, Apurva Mehta wrote: > > The KIP looks good to me. In your latest proposal, the change of state > > would be captured as followed in the metrics for groups using Kafka for > > membership management: > > > > PreparingRebalance -> CompletingRebalance -> Stable -> Dead? > > Right. As described in > core/src/main/scala/kafka/coordinator/group/GroupMetadata.scala > > > > > If a group is just being used to store offsets, then it is always Empty? > > I believe so. Groups can also be Empty if they have no more members, > but the offsets have not yet expired. > > Of course, this KIP is just exposing the metrics, not changing how > groups work in any way. > > Maybe we should start a vote, if this looks good to everyone? > > best, > Colin > > > > > > If so, this makes sense to me. > > > > Thanks, > > Apurva > > > > On Mon, Aug 7, 2017 at 5:09 PM, Colin McCabe <co...@cmccabe.xyz> wrote: > > > > > How about PreparingRebalance / CompletingRebalance? > > > > > > cheers, > > > Colin > > > > > > > > > On Fri, Aug 4, 2017, at 09:03, Ismael Juma wrote: > > > > I agree that we should make them consistent. I think RebalanceJoin > and > > > > RebalanceAssignment are reasonable names. I think they are a bit more > > > > descriptive than `PreparingRebalance` and `CompletingRebalance`. If > we > > > > need > > > > to add more states, it seems a little easier to do if the states are > a > > > > bit > > > > more descriptive. I am OK with either of the 2 options as I think > they > > > > are > > > > both better than the status quo. > > > > > > > > Ismael > > > > > > > > On Fri, Aug 4, 2017 at 4:52 PM, Jason Gustafson <ja...@confluent.io> > > > > wrote: > > > > > > > > > Hey Guozhang, > > > > > > > > > > Usually I think such naming inconsistencies are best avoided. It > adds > > > > > another level of confusion for people who have to dip into the > code, > > > figure > > > > > out a problem, and ultimately explain it. Since we already have the > > > > > PreparingRebalance state, maybe we could just rename the > AwaitingSync > > > state > > > > > to CompletingRebalance? > > > > > > > > > > -Jason > > > > > > > > > > On Thu, Aug 3, 2017 at 6:09 PM, Guozhang Wang <wangg...@gmail.com> > > > wrote: > > > > > > > > > > > From an ops person's view who are mostly likely watching the > metrics > > > > > these > > > > > > names may not be very clear as people may not know the internals > > > well. > > > > > I'd > > > > > > prefer PrepareRebalance and CompleteRebalance since they may be > > > easier to > > > > > > understand thought not 100 percent accurately match to internal > > > > > > implementation. > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > On Thu, Aug 3, 2017 at 4:14 PM, Jason Gustafson < > ja...@confluent.io> > > > > > > wrote: > > > > > > > > > > > > > Hey Colin, Guozhang, > > > > > > > > > > > > > > I agree the current state names are not ideal for end users. I > > > tend to > > > > > > see > > > > > > > the rebalance as joining the group and receiving the > assignment. > > > Maybe > > > > > > the > > > > > > > states could be named in those terms? For example: > RebalanceJoin > > > and > > > > > > > RebalanceAssignment. What do you think? > > > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > On Fri, Jul 28, 2017 at 11:18 AM, Guozhang Wang < > > > wangg...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > I feel we can change `AwaitSync` to `completeRebalance` while > > > keeping > > > > > > the > > > > > > > > other as is. > > > > > > > > > > > > > > > > cc Jason? > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > On Fri, Jul 28, 2017 at 10:08 AM, Colin McCabe < > > > cmcc...@apache.org> > > > > > > > wrote: > > > > > > > > > > > > > > > >> Thanks for the explanation. I guess maybe we should just > keep > > > the > > > > > > group > > > > > > > >> names as they are, then? > > > > > > > >> > > > > > > > >> best, > > > > > > > >> Colin > > > > > > > >> > > > > > > > >> > > > > > > > >> On Wed, Jul 26, 2017, at 11:25, Guozhang Wang wrote: > > > > > > > >> > To me `PreparingRebalance` sounds better than > > > `StartingRebalance` > > > > > > > since > > > > > > > >> > only by the end of that stage we have formed a new group. > More > > > > > > > >> > specifically, this this the workflow from the > coordinator's > > > point > > > > > of > > > > > > > >> > view: > > > > > > > >> > > > > > > > > >> > 1. decided to trigger a rebalance, enter > PreparingRebalance > > > phase; > > > > > > > >> > | > > > > > > > >> > | send out error code for all > heartbeat > > > > > reponses > > > > > > > >> > \|/ > > > > > > > >> > | > > > > > > > >> > | waiting for join group requests from > > > members > > > > > > > >> > \|/ > > > > > > > >> > 2. formed a new group, increment the generation number, > now > > > start > > > > > > > >> > rebalancing, entering AwaitSync phase: > > > > > > > >> > | > > > > > > > >> > | send out the join group responses > for > > > > > whoever > > > > > > > >> > requested join > > > > > > > >> > \|/ > > > > > > > >> > | > > > > > > > >> > | waiting for the sync group request > from > > > the > > > > > > > leader > > > > > > > >> > \|/ > > > > > > > >> > 3. received assignment from the leader; the rebalance has > > > ended, > > > > > > start > > > > > > > >> > ticking for all members, entering Stable phase. > > > > > > > >> > | > > > > > > > >> > | for whoever else sending the sync > group > > > > > > request, > > > > > > > >> > reply with the assignment > > > > > > > >> > \|/ > > > > > > > >> > > > > > > > > >> > So from the coordinator's point of view the rebalance > starts > > > at > > > > > > > >> beginning > > > > > > > >> > of step 2 and ends at beginning of step 3. Maybe we can > rename > > > > > > > >> > `AwaitSync` > > > > > > > >> > itself to `CompletingRebalance`. > > > > > > > >> > > > > > > > > >> > Guozhang > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > On Tue, Jul 25, 2017 at 6:44 AM, Ismael Juma < > > > ism...@juma.me.uk> > > > > > > > wrote: > > > > > > > >> > > > > > > > > >> > > Hi Guozhang, > > > > > > > >> > > > > > > > > > >> > > Thanks for the clarification. The naming does seem a bit > > > > > unclear. > > > > > > > >> Maybe > > > > > > > >> > > `PreparingRebalance` could be `StartingRebalance` or > > > something > > > > > > that > > > > > > > >> makes > > > > > > > >> > > it clear that it is part of the rebalance instead of a > step > > > > > before > > > > > > > the > > > > > > > >> > > actual rebalance. `AwaitingSync` could also be > > > > > > > `CompletingRebalance`, > > > > > > > >> but > > > > > > > >> > > not sure if that's better. > > > > > > > >> > > > > > > > > > >> > > Ismael > > > > > > > >> > > > > > > > > > >> > > On Mon, Jul 24, 2017 at 7:02 PM, Guozhang Wang < > > > > > > wangg...@gmail.com> > > > > > > > >> wrote: > > > > > > > >> > > > > > > > > > >> > > > Actually Rebalancing includes two steps, and we name > them > > > > > > > >> > > PrepareRebalance > > > > > > > >> > > > and WaitSync (arguably they may not be the best > names). > > > But > > > > > > these > > > > > > > >> two > > > > > > > >> > > steps > > > > > > > >> > > > together should be treated as the complete rebalance > > > cycle. > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > Guozhang > > > > > > > >> > > > > > > > > > > >> > > > On Mon, Jul 24, 2017 at 10:46 AM, Colin McCabe < > > > > > > > cmcc...@apache.org> > > > > > > > >> > > wrote: > > > > > > > >> > > > > > > > > > > >> > > > > Hi all, > > > > > > > >> > > > > > > > > > > > >> > > > > I think maybe it makes sense to rename the > > > > > > "PreparingRebalance" > > > > > > > >> > > consumer > > > > > > > >> > > > > group state to "Rebalancing." To me, "Preparing" > > > implies > > > > > that > > > > > > > >> there > > > > > > > >> > > > > will be a later "rebalance" state that follows-- but > > > there > > > > > is > > > > > > > not. > > > > > > > >> > > > > Since we're now exposing this state name publicly in > > > these > > > > > > > >> metrics, > > > > > > > >> > > > > perhaps it makes sense to do this rename now. > Thoughts? > > > > > > > >> > > > > > > > > > > > >> > > > > best, > > > > > > > >> > > > > Colin > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > On Fri, Jul 21, 2017, at 13:52, Colin McCabe wrote: > > > > > > > >> > > > > > That's a good point. I revised the KIP to add > > > metrics for > > > > > > all > > > > > > > >> the > > > > > > > >> > > > group > > > > > > > >> > > > > > states. > > > > > > > >> > > > > > > > > > > > > >> > > > > > best, > > > > > > > >> > > > > > Colin > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > On Fri, Jul 21, 2017, at 12:08, Guozhang Wang > wrote: > > > > > > > >> > > > > > > Ah, that's right Jason. > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > With that I can be convinced to add one metric > per > > > each > > > > > > > state. > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > Guozhang > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > On Fri, Jul 21, 2017 at 11:44 AM, Jason > Gustafson < > > > > > > > >> > > > ja...@confluent.io> > > > > > > > >> > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > "Dead" and "Empty" states are transient: > groups > > > > > > usually > > > > > > > >> only > > > > > > > >> > > > > leaves in > > > > > > > >> > > > > > > > this > > > > > > > >> > > > > > > > > state for a short while and then being > deleted > > > or > > > > > > > >> transited to > > > > > > > >> > > > > other > > > > > > > >> > > > > > > > > states. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > This is not strictly true for the "Empty" > state > > > which > > > > > we > > > > > > > >> also use > > > > > > > >> > > > to > > > > > > > >> > > > > > > > represent simple groups which only use the > > > coordinator > > > > > > to > > > > > > > >> store > > > > > > > >> > > > > offsets. I > > > > > > > >> > > > > > > > think we may as well cover all the states if > we're > > > > > going > > > > > > > to > > > > > > > >> cover > > > > > > > >> > > > > any of > > > > > > > >> > > > > > > > them specifically. > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > -Jason > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > On Fri, Jul 21, 2017 at 9:45 AM, Guozhang > Wang < > > > > > > > >> > > wangg...@gmail.com > > > > > > > >> > > > > > > > > > > > >> > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > My two cents: > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > "Dead" and "Empty" states are transient: > groups > > > > > > usually > > > > > > > >> only > > > > > > > >> > > > > leaves in > > > > > > > >> > > > > > > > this > > > > > > > >> > > > > > > > > state for a short while and then being > deleted > > > or > > > > > > > >> transited to > > > > > > > >> > > > > other > > > > > > > >> > > > > > > > > states. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > Since we have the existing "*NumGroups*" > metric, > > > > > > > >> `*NumGroups - > > > > > > > >> > > > > > > > > **NumGroupsRebalancing > > > > > > > >> > > > > > > > > - **NumGroupsAwaitingSync`* should cover the > > > above > > > > > > > three, > > > > > > > >> where > > > > > > > >> > > > > "Stable" > > > > > > > >> > > > > > > > > should be contributing most of the counts: > If we > > > > > have > > > > > > a > > > > > > > >> bug > > > > > > > >> > > that > > > > > > > >> > > > > causes > > > > > > > >> > > > > > > > the > > > > > > > >> > > > > > > > > num.Dead / Empty to keep increasing, then we > > > would > > > > > > > observe > > > > > > > >> > > > > `NumGroups` > > > > > > > >> > > > > > > > keep > > > > > > > >> > > > > > > > > increasing which should be sufficient for > > > alerting. > > > > > > And > > > > > > > >> trouble > > > > > > > >> > > > > shooting > > > > > > > >> > > > > > > > of > > > > > > > >> > > > > > > > > the issue could be relying on the log4j. > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > *Guozhang* > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > On Fri, Jul 21, 2017 at 7:19 AM, Ismael > Juma < > > > > > > > >> > > ism...@juma.me.uk> > > > > > > > >> > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > Thanks for the KIP, Colin. This will > > > definitely be > > > > > > > >> useful. > > > > > > > >> > > One > > > > > > > >> > > > > > > > question: > > > > > > > >> > > > > > > > > > would it be useful to have a metric for > for > > > the > > > > > > number > > > > > > > >> of > > > > > > > >> > > > groups > > > > > > > >> > > > > in > > > > > > > >> > > > > > > > each > > > > > > > >> > > > > > > > > > possible state? The KIP suggests > > > > > > "PreparingRebalance" > > > > > > > >> and > > > > > > > >> > > > > > > > "AwaitingSync". > > > > > > > >> > > > > > > > > > That leaves "Stable", "Dead" and "Empty". > Are > > > > > those > > > > > > > not > > > > > > > >> > > useful? > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > Ismael > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > On Thu, Jul 20, 2017 at 6:52 PM, Colin > McCabe > > > < > > > > > > > >> > > > > cmcc...@apache.org> > > > > > > > >> > > > > > > > > wrote: > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > Hi all, > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > I posted "KIP-180: Add a broker metric > > > > > specifying > > > > > > > the > > > > > > > >> > > number > > > > > > > >> > > > of > > > > > > > >> > > > > > > > > consumer > > > > > > > >> > > > > > > > > > > group rebalances in progress" for > > > discussion: > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confl > > > > > > > >> uence/display/KAFKA/KIP- > > > > > > > >> > > > > > > > > > > 180%3A+Add+a+broker+metric+ > > > > > > > specifying+the+number+of+ > > > > > > > >> > > > > > > > > > > consumer+group+rebalances+in+progress > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > Check it out. > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > cheers, > > > > > > > >> > > > > > > > > > > Colin > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > -- > > > > > > > >> > > > > > > > > -- Guozhang > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > -- > > > > > > > >> > > > > > > -- Guozhang > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > -- > > > > > > > >> > > > -- Guozhang > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > -- > > > > > > > >> > -- Guozhang > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -- Guozhang > > > > > > > > > > > > > > >