The wiki page and the patch LGTM. Just a couple of minor comments: ---- On the wiki page, definition of* replica.lag.time.max.ms <http://replica.lag.time.max.ms>*: just mention that "If a follower hasn't sent..." is its original definition, and now we are adding another aspect in addition to this.
I think it is better to add the server metrics of replica-inactive-time as "time.millisecond - r.logEndOffsetUpdateTimeMs", and replica-lag-time for each replica. ---- Guozhang On Fri, Mar 13, 2015 at 5:25 PM, Joel Koshy <jjkosh...@gmail.com> wrote: > Thanks for the write-up. Looks good to me. Minor comments: > > Under proposed changes: > > """The proposal is to calculate replica lag as the amount of time not > caught up to the leader. A replica is only in ISR if it is caught > up.""" > > It may be worth clarifying this on the KIP a bit more - i.e., we > currently have two forms of replica lag (time-based and message-count > based). The latter is hard to do correctly across all topics and the > former is in fact sufficient. i.e., we now have only one concept: "For > how long can a replica be out-of-sync before it is explicitly removed > from the ISR? This is the new interpretation of replica.lag.max.ms. A > replica that is currently in ISR and happens to fall behind due to a > burst is given this grace period to catch up to the leader without > being removed from ISR." > > Also, may want to warn if replica.max.messages is provided (doesn't seem > to be there in the current patch). > > Joel > > On Thu, Mar 12, 2015 at 09:35:24PM +0000, Aditya Auradkar wrote: > > I will change the wording to reflect this. But yes, a broker follower > should only enter the ISR once it is fully caught up. > > > > Caught up means that the follower has read from the log end offset from > the broker. I'm using the log end offset from before the actual read > operation to avoid these off by one errors. In any case, I plan to run this > locally with a small cluster and see how it performs. > > > > Aditya > > > > ________________________________________ > > From: Joe Stein [joe.st...@stealth.ly] > > Sent: Thursday, March 12, 2015 1:54 PM > > To: dev@kafka.apache.org > > Subject: Re: [DISCUSS] KIP 16 - Replica lag tuning > > > > Hi Aditya, thanks for the writeup. > > > > Lets say a broker follower goes down. And it is down for an hour or > two.... > > > > When the broker follower comes back up it will start sending fetch > requests > > (lets say every 2ms which would be under a configured lets say 100ms > > (whatever)). Then right away the brokers gets added back to the ISR? > > > > Maybe it is just the wording or how I am reading it... I think/thought > that > > once the replica is caught up THEN the setting goes into action and as > long > > as (every 100ms ... whatever) the broker leader is seeing the broker > > follower as "caught up" then it is in the ISR. > > > > Also, what is the definition of "caught up" now without the number of > > messages? If it is === i worry about that not happening in some networks > > where it is always off by one or something maybe? > > > > ~ Joe Stein > > - - - - - - - - - - - - - - - - - > > > > http://www.stealth.ly > > - - - - - - - - - - - - - - - - - > > > > On Thu, Mar 12, 2015 at 4:36 PM, Aditya Auradkar < > > aaurad...@linkedin.com.invalid> wrote: > > > > > I wrote a KIP for this after some discussion on KAFKA-1546. > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning > > > > > > The RB is here: https://reviews.apache.org/r/31967/ > > > > > > Thanks, > > > Aditya > > > > > -- -- Guozhang