Thanks Jun for the detailed feedback. Yes, for #1, I mean the live replicas from the ISR.
Actually, I believe for all of the 4 new leader election strategies (offline, reassign, preferred replica and controlled shutdown), we need to make corresponding changes. Will document the details in the KIP. On Wed, Jan 24, 2018 at 3:59 PM, Jun Rao <j...@confluent.io> wrote: > Hi, Litao, > > Thanks for the KIP. Good proposal. A few comments below. > > 1. The KIP says "select the live replica with the largest LEO". I guess > what you meant is selecting the live replicas in ISR with the largest LEO? > > 2. I agree that we can probably just reuse the current min.isr > configuration, but with a slightly different semantics. Currently, if > min.isr is set, a user expects the record to be in at least min.isr > replicas on successful ack. This KIP guarantees this too. Most people are > probably surprised that currently the ack is only sent back after all > replicas in ISR receive the record. This KIP will change the ack to only > wait on min.isr replicas, which matches the user's expectation and gives > better latency. Currently, we guarantee no data loss if there are fewer > than replication factor failures. The KIP changes that to fewer than > min.isr failures. The latter probably matches the user expectation. > > 3. I agree that the new leader election process is a bit more complicated. > The controller now needs to contact all replicas in ISR to determine who > has the longest log. However, this happens infrequently. So, it's probably > worth doing for the better latency in #2. > > 4. We have to think through the preferred leader election process. > Currently, the first assigned replica is preferred for load balancing. > There is a process to automatically move the leader to the preferred > replica when it's in sync. The issue is that the preferred replica may no > be the replica with the longest log. Naively switching to the preferred > replica may cause data loss when there are actually fewer failures than > configured min.isr. One way to address this issue is to do the following > steps during preferred leader election: (a) controller sends an RPC request > to the current leader; (b) the current leader stops taking new writes > (sending a new error code to the clients) and returns its LEO (call it L) > to the controller; (c) the controller issues an RPC request to the > preferred replica and waits its LEO to reach L; (d) the controller changes > the leader to the preferred replica. > > Jun > > On Wed, Jan 24, 2018 at 2:51 PM, Litao Deng <litao.d...@airbnb.com.invalid > > > wrote: > > > Sorry folks, just realized I didn't use the correct thread format for the > > discussion. I started this new one and copied all of the responses from > the > > old one. > > > > @Dong > > It makes sense to just use the min.insync.replicas instead of > introducing a > > new config, and we must make this change together with the LEO-based new > > leader election. > > > > @Xi > > I thought about embedding the LEO information to the ControllerContext, > > didn't find a way. Using RPC will make the leader election period longer > > and this should happen in very rare cases (broker failure, controlled > > shutdown, preferred leader election and partition reassignment). > > > > @Jeff > > The current leader election is to pick the first replica from AR which > > exists both in the live brokers and ISR sets. I agree with you about > > changing the current/default behavior will cause many confusions, and > > that's the reason the title is "Add Support ...". In this case, we > wouldn't > > break any current promises and provide a separate option for our user. > > In terms of KIP-250, I feel it is more like the "Semisynchronous > > Replication" in the MySQL world, and yes it is something between acks=1 > and > > acks=insync.replicas. Additionally, I feel KIP-250 and KIP-227 are > > two orthogonal improvements. KIP-227 is to improve the replication > protocol > > (like the introduction of parallel replication in MySQL), and KIP-250 is > an > > enhancement for the replication architecture (sync, semi-sync, and > async). > > > > > > Dong Lin > > > > > Thanks for the KIP. I have one quick comment before you provide more > > detail > > > on how to select the leader with the largest LEO. > > > Do you think it would make sense to change the default behavior of > > acks=-1, > > > such that broker will acknowledge the message once the message has been > > > replicated to min.insync.replicas brokers? This would allow us to keep > > the > > > same durability guarantee, improve produce request latency without > > having a > > > new config. > > > > > > Hu Xi > > > > > Currently, with holding the assigned replicas(AR) for all partitions, > > > controller is now able to elect new leaders by selecting the first > > replica > > > of AR which occurs in both live replica set and ISR. If switching to > the > > > LEO-based strategy, controller context might need to be enriched or > > > augmented to store those values. If retrieving those LEOs real-time, > > > several rounds of RPCs are unavoidable which seems to violate the > > original > > > intention of this KIP. > > > > > > Jeff Widman > > > > > I agree with Dong, we should see if it's possible to change the default > > > behavior so that as soon as min.insync.replicas brokers respond than > the > > > broker acknowledges the message back to the client without waiting for > > > additional brokers who are in the in-sync replica list to respond. (I > > > actually thought it already worked this way). > > > As you implied in the KIP though, changing this default introduces a > > weird > > > state where an in-sync follower broker is not guaranteed to have a > > > message... > > > So at a minimum, the leadership failover algorithm would need to be > sure > > to > > > pick the most up-to-date follower... I thought it already did this? > > > But if multiple brokers fail in quick succession, then a broker that > was > > in > > > the ISR could become a leader without ever receiving the message... > > > violating the current promises of unclean.leader.election. > > enable=False... > > > so changing the default might be not be a tenable solution. > > > What also jumped out at me in the KIP was the goal of reducing p999 > when > > > setting replica lag time at 10 seconds(!!)... I understand the desire > to > > > minimize frequent ISR shrink/expansion, as I face this same issue at my > > day > > > job. But what you're essentially trying to do here is create an > > additional > > > replication state that is in-between acks=1 and acks = ISR to paper > over > > a > > > root problem of ISR shrink/expansion... > > > I'm just wary of shipping more features (and more operational > confusion) > > if > > > it's only addressing the symptom rather than the root cause. For > example, > > > my day job's problem is we run a very high number of low-traffic > > > partitions-per-broker, so the fetch requests hit many partitions before > > > they fill. Solving that requires changing our architecture + making the > > > replication protocol more efficient (KIP-227). > > > > > > On Tue, Jan 23, 2018 at 10:02 PM, Litao Deng <litao.d...@airbnb.com> > > wrote: > > > > > Hey folks. I would like to add a feature to support the quorum-based > > > acknowledgment for the producer request. We have been running a > modified > > > version of Kafka on our testing cluster for weeks, the improvement of > > P999 > > > is significant with very stable latency. Additionally, I have a > proposal > > to > > > achieve a similar data durability as with the insync.replicas-based > > > acknowledgment through LEO-based leader election. > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 250+Add+Support+for+Quorum-based+Producer+Acknowledge > > > > > >