Thanks Jun for the detailed feedback.

Yes, for #1, I mean the live replicas from the ISR.

Actually, I believe for all of the 4 new leader election strategies
(offline, reassign, preferred replica and controlled shutdown), we need to
make corresponding changes. Will document the details in the KIP.

On Wed, Jan 24, 2018 at 3:59 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Litao,
>
> Thanks for the KIP. Good proposal. A few comments below.
>
> 1. The KIP says "select the live replica with the largest LEO".  I guess
> what you meant is selecting the live replicas in ISR with the largest LEO?
>
> 2. I agree that we can probably just reuse the current min.isr
> configuration, but with a slightly different semantics. Currently, if
> min.isr is set, a user expects the record to be in at least min.isr
> replicas on successful ack. This KIP guarantees this too. Most people are
> probably surprised that currently the ack is only sent back after all
> replicas in ISR receive the record. This KIP will change the ack to only
> wait on min.isr replicas, which matches the user's expectation and gives
> better latency. Currently, we guarantee no data loss if there are fewer
> than replication factor failures. The KIP changes that to fewer than
> min.isr failures. The latter probably matches the user expectation.
>
> 3. I agree that the new leader election process is a bit more complicated.
> The controller now needs to contact all replicas in ISR to determine who
> has the longest log. However, this happens infrequently. So, it's probably
> worth doing for the better latency in #2.
>
> 4. We have to think through the preferred leader election process.
> Currently, the first assigned replica is preferred for load balancing.
> There is a process to automatically move the leader to the preferred
> replica when it's in sync. The issue is that the preferred replica may no
> be the replica with the longest log. Naively switching to the preferred
> replica may cause data loss when there are actually fewer failures than
> configured min.isr. One way to address this issue is to do the following
> steps during preferred leader election: (a) controller sends an RPC request
> to the current leader; (b) the current leader stops taking new writes
> (sending a new error code to the clients) and returns its LEO (call it L)
> to the controller; (c) the controller issues an RPC request to the
> preferred replica and waits its LEO to reach L; (d) the controller changes
> the leader to the preferred replica.
>
> Jun
>
> On Wed, Jan 24, 2018 at 2:51 PM, Litao Deng <litao.d...@airbnb.com.invalid
> >
> wrote:
>
> > Sorry folks, just realized I didn't use the correct thread format for the
> > discussion. I started this new one and copied all of the responses from
> the
> > old one.
> >
> > @Dong
> > It makes sense to just use the min.insync.replicas instead of
> introducing a
> > new config, and we must make this change together with the LEO-based new
> > leader election.
> >
> > @Xi
> > I thought about embedding the LEO information to the ControllerContext,
> > didn't find a way. Using RPC will make the leader election period longer
> > and this should happen in very rare cases (broker failure, controlled
> > shutdown, preferred leader election and partition reassignment).
> >
> > @Jeff
> > The current leader election is to pick the first replica from AR which
> > exists both in the live brokers and ISR sets. I agree with you about
> > changing the current/default behavior will cause many confusions, and
> > that's the reason the title is "Add Support ...". In this case, we
> wouldn't
> > break any current promises and provide a separate option for our user.
> > In terms of KIP-250, I feel it is more like the "Semisynchronous
> > Replication" in the MySQL world, and yes it is something between acks=1
> and
> > acks=insync.replicas. Additionally, I feel KIP-250 and KIP-227 are
> > two orthogonal improvements. KIP-227 is to improve the replication
> protocol
> > (like the introduction of parallel replication in MySQL), and KIP-250 is
> an
> > enhancement for the replication architecture (sync, semi-sync, and
> async).
> >
> >
> > Dong Lin
> >
> > > Thanks for the KIP. I have one quick comment before you provide more
> > detail
> > > on how to select the leader with the largest LEO.
> > > Do you think it would make sense to change the default behavior of
> > acks=-1,
> > > such that broker will acknowledge the message once the message has been
> > > replicated to min.insync.replicas brokers? This would allow us to keep
> > the
> > > same durability guarantee, improve produce request latency without
> > having a
> > > new config.
> >
> >
> > Hu Xi
> >
> > > Currently,  with holding the assigned replicas(AR) for all partitions,
> > > controller is now able to elect new leaders by selecting the first
> > replica
> > > of AR which occurs in both live replica set and ISR. If switching to
> the
> > > LEO-based strategy, controller context might need to be enriched or
> > > augmented to store those values.  If retrieving those LEOs real-time,
> > > several rounds of RPCs are unavoidable which seems to violate the
> > original
> > > intention of this KIP.​
> >
> >
> > Jeff Widman
> >
> > > I agree with Dong, we should see if it's possible to change the default
> > > behavior so that as soon as min.insync.replicas brokers respond than
> the
> > > broker acknowledges the message back to the client without waiting for
> > > additional brokers who are in the in-sync replica list to respond. (I
> > > actually thought it already worked this way).
> > > As you implied in the KIP though, changing this default introduces a
> > weird
> > > state where an in-sync follower broker is not guaranteed to have a
> > > message...
> > > So at a minimum, the leadership failover algorithm would need to be
> sure
> > to
> > > pick the most up-to-date follower... I thought it already did this?
> > > But if multiple brokers fail in quick succession, then a broker that
> was
> > in
> > > the ISR could become a leader without ever receiving the message...
> > > violating the current promises of unclean.leader.election.
> > enable=False...
> > > so changing the default might be not be a tenable solution.
> > > What also jumped out at me in the KIP was the goal of reducing p999
> when
> > > setting replica lag time at 10 seconds(!!)... I understand the desire
> to
> > > minimize frequent ISR shrink/expansion, as I face this same issue at my
> > day
> > > job. But what you're essentially trying to do here is create an
> > additional
> > > replication state that is in-between acks=1 and acks = ISR to paper
> over
> > a
> > > root problem of ISR shrink/expansion...
> > > I'm just wary of shipping more features (and more operational
> confusion)
> > if
> > > it's only addressing the symptom rather than the root cause. For
> example,
> > > my day job's problem is we run a very high number of low-traffic
> > > partitions-per-broker, so the fetch requests hit many partitions before
> > > they fill. Solving that requires changing our architecture + making the
> > > replication protocol more efficient (KIP-227).
> >
> >
> > On Tue, Jan 23, 2018 at 10:02 PM, Litao Deng <litao.d...@airbnb.com>
> > wrote:
> >
> > > Hey folks. I would like to add a feature to support the quorum-based
> > > acknowledgment for the producer request. We have been running a
> modified
> > > version of Kafka on our testing cluster for weeks, the improvement of
> > P999
> > > is significant with very stable latency. Additionally, I have a
> proposal
> > to
> > > achieve a similar data durability as with the insync.replicas-based
> > > acknowledgment through LEO-based leader election.
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 250+Add+Support+for+Quorum-based+Producer+Acknowledge
> > >
> >
>

Reply via email to