The wiki page and the patch LGTM. Just a couple of minor comments:

----
On the wiki page, definition of* replica.lag.time.max.ms
<http://replica.lag.time.max.ms>*: just mention that "If a follower hasn't
sent..." is its original definition, and now we are adding another aspect
in addition to this.

I think it is better to add the server metrics of replica-inactive-time as
"time.millisecond - r.logEndOffsetUpdateTimeMs", and replica-lag-time for
each replica.

----

Guozhang

On Fri, Mar 13, 2015 at 5:25 PM, Joel Koshy <jjkosh...@gmail.com> wrote:

> Thanks for the write-up. Looks good to me. Minor comments:
>
> Under proposed changes:
>
> """The proposal is to calculate replica lag as the amount of time not
> caught up to the leader. A replica is only in ISR if it is caught
> up."""
>
> It may be worth clarifying this on the KIP a bit more - i.e., we
> currently have two forms of replica lag (time-based and message-count
> based). The latter is hard to do correctly across all topics and the
> former is in fact sufficient. i.e., we now have only one concept: "For
> how long can a replica be out-of-sync before it is explicitly removed
> from the ISR?  This is the new interpretation of replica.lag.max.ms. A
> replica that is currently in ISR and happens to fall behind due to a
> burst is given this grace period to catch up to the leader without
> being removed from ISR."
>
> Also, may want to warn if replica.max.messages is provided (doesn't seem
> to be there in the current patch).
>
> Joel
>
> On Thu, Mar 12, 2015 at 09:35:24PM +0000, Aditya Auradkar wrote:
> > I will change the wording to reflect this. But yes, a broker follower
> should only enter the ISR once it is fully caught up.
> >
> > Caught up means that the follower has read from the log end offset from
> the broker. I'm using the log end offset from before the actual read
> operation to avoid these off by one errors. In any case, I plan to run this
> locally with a small cluster and see how it performs.
> >
> > Aditya
> >
> > ________________________________________
> > From: Joe Stein [joe.st...@stealth.ly]
> > Sent: Thursday, March 12, 2015 1:54 PM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP 16 - Replica lag tuning
> >
> > Hi Aditya, thanks for the writeup.
> >
> > Lets say a broker follower goes down. And it is down for an hour or
> two....
> >
> > When the broker follower comes back up it will start sending fetch
> requests
> > (lets say every 2ms which would be under a configured lets say 100ms
> > (whatever)). Then right away the brokers gets added back to the ISR?
> >
> > Maybe it is just the wording or how I am reading it... I think/thought
> that
> > once the replica is caught up THEN the setting goes into action and as
> long
> > as (every 100ms ... whatever) the broker leader is seeing the broker
> > follower as "caught up" then it is in the ISR.
> >
> > Also, what is the definition of "caught up" now without the number of
> > messages? If it is === i worry about that not happening in some networks
> > where it is always off by one or something maybe?
> >
> > ~ Joe Stein
> > - - - - - - - - - - - - - - - - -
> >
> >   http://www.stealth.ly
> > - - - - - - - - - - - - - - - - -
> >
> > On Thu, Mar 12, 2015 at 4:36 PM, Aditya Auradkar <
> > aaurad...@linkedin.com.invalid> wrote:
> >
> > > I wrote a KIP for this after some discussion on KAFKA-1546.
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+16+:+Automated+Replica+Lag+Tuning
> > >
> > > The RB is here: https://reviews.apache.org/r/31967/
> > >
> > > Thanks,
> > > Aditya
> > >
>
>


-- 
-- Guozhang

Reply via email to