Hi Jack Thanks for the comment. I have updated the reassignment part. Now the reassignment can only be completed/canceled if the final ISR size is larger than min ISR. Thanks to your efforts of the TLA+! It has been a great help to the KIP!
On Wed, Sep 6, 2023 at 6:32 AM Jack Vanlightly <vanligh...@apache.org> wrote: > Hi Calvin, > > Regarding partition reassignment, I have two comments. > > I notice the KIP says "The AlterPartitionReassignments will not change the > ELR" however, when a reassignment completes (or reverts) any replicas > removed from the replica set would be removed from the ELR. Sounds obvious > but I figured we should be explicit about that. > > Reassignment should also respect min.insync.replicas because currently a > reassignment can complete as long as the ISR is not empty and all added > replicas are members. However, my TLA+ specification, which now includes > reassignment, finds single broker failures that can cause committed data > loss - despite the added protection of the ELR and min.insync.replicas=2. > These scenarios are limited to shrinking the size of the replica set. If we > modify the PartitionChangeBuilder to add the completion condition that the > target ISR >= min.insync.replicas, then that closes this last > single-broker-failure data loss case. > > With the above modification, the TLA+ specification of the ELR part of the > design is standing up to all safety and liveness checks. The only thing > that is not modeled is the unclean recovery though I may leave that as the > specification is already very large. > > Jack > > On 2023/09/01 22:27:10 Calvin Liu wrote: > > Hi Justine > > 1. With the new proposal, in order to let the consumer consume a message > > when only 1 replica commits it to its log, the min ISR has to be set to > 1. > > 2. Yes, we will need an unclean recovery if the leader has an unclean > > shutdown. > > 3. If the min ISR config is changed to a larger value, the ISR and ELR > will > > not be updated. ISR members are always valid no matter how min ISR > changes. > > If ELR is not empty, then the HWM can't advance as well after the min ISR > > increase, so the ELR members are safe to stay. > > 4. I will highlight the explanation. Thanks. > > > > On Thu, Aug 31, 2023 at 4:35 PM Justine Olshan > <jols...@confluent.io.invalid> > > wrote: > > > > > Hey Calvin, > > > > > > Thanks for the responses. I think I understood most of it, but had a > few > > > follow up questions > > > > > > 1. For the acks=1 case, I was wondering if there is any way to continue > > > with the current behavior (ie -- we only need one ack to produce to > the log > > > and consider the request complete.) My understanding is that we can > also > > > consume from such topics at that point. > > > If users wanted this lower durability could they set > min.insync.replicas to > > > 1? > > > > > > 2. For the case where we elect a leader that was unknowingly offline. > Say > > > this replica was the only one in ELR. My understanding is that we would > > > promote it to ISR and remove it from ELR when it is the leader, but > then we > > > would remove it from ISR and have no brokers in ISR or ELR. From there > we > > > would need to do unclean recovery right? > > > > > > 3. Did we address the case where dynamically min isr is increased? > > > > > > 4. I think my comment was more about confusion on the KIP. It was not > clear > > > to me that the section was describing points if one was done before the > > > other. But I now see the sentence explaining that. I think I skipped > from > > > "delivery plan" to the bullet points. > > > > > > Justine > > > > > > On Thu, Aug 31, 2023 at 4:04 PM Calvin Liu <ca...@confluent.io.invalid > > > > > wrote: > > > > > > > Hi Justine > > > > Thanks for the questions! > > > > *a. For my understanding, will we block replication? Or just the > high > > > > watermark advancement?* > > > > - The replication will not be blocked. The followers are free to > > > > replicate messages above the HWM. Only HWM advancement is blocked. > > > > > > > > b. *Also in the acks=1 case, if folks want to continue the previous > > > > behavior, they also need to set min.insync.replicas to 1, correct?* > > > > - If the clients only send ack=1 messages and minISR=2. The HWM > > > behavior > > > > will only be different when there is 1 replica in the ISR. In this > case, > > > > the min ISR does not do much in the current system. It is kind of a > > > > trade-off but we think it is ok. > > > > > > > > c. *The KIP seems to suggest that we remove from ELR when we start > up > > > > again and notice we do not have the clean shutdown file. Is there a > > > chance > > > > we have an offline broker in ELR that had an unclean shutdown that we > > > elect > > > > as a leader before we get the change to realize the shutdown was > > > unclean?* > > > > * - *The controller will only elect an unfenced(online) replica as > the > > > > leader. If a broker has an unclean shutdown, it should register to > the > > > > controller first(where it has to declair whether it is a > clean/unclean > > > > shutdown) and then start to serve broker requests. So > > > > 1. If the broker has an unclean shutdown before the controller > is > > > > aware that the replica is offline, then the broker can become the > leader > > > > temporarily. But it can't serve any Fetch requests before it > registers > > > > again, and that's when the controller will re-elect a leader. > > > > 2. If the controller knows the replica is offline(missing > heartbeats > > > > from the broker for a while) before the broker re-registers, the > broker > > > > can't be elected as a leader. > > > > > > > > d. *Would this be the case for strictly a smaller min ISR?* > > > > - Yes, only when we have a smaller min ISR. Once the leader is aware > of > > > the > > > > minISR change, the HWM can advance and make the current ELR > obsolete. So > > > > the controller should clear the ELR if the ISR >= the new min ISR. > > > > > > > > e. *I thought we said the above "Last Leader” behavior can’t be > > > maintained > > > > with an empty ISR and it should be removed."* > > > > - As the Kip is a big one, we have to consider delivering it in > phases. > > > If > > > > only the Unclean Recovery is delivered, we do not touch the ISR then > the > > > > ISR behavior will be the same as the current. I am open to the > proposal > > > > that directly starting unclean recovery if the last leader fails. > Let's > > > see > > > > if other folks hope to have more if Unclean Recover delivers first. > > > > > > > > On Tue, Aug 29, 2023 at 4:53 PM Justine Olshan > > > > <jols...@confluent.io.invalid> > > > > wrote: > > > > > > > > > Hey Calvin, > > > > > > > > > > Thanks for the KIP. This will close some of the gaps in leader > > > election! > > > > I > > > > > has a few questions: > > > > > > > > > > *>* *High Watermark can only advance if the ISR size is larger or > equal > > > > > to min.insync.replicas*. > > > > > > > > > > For my understanding, will we block replication? Or just the high > > > > watermark > > > > > advancement? > > > > > Also in the acks=1 case, if folks want to continue the previous > > > behavior, > > > > > they also need to set min.insync.replicas to 1, correct? It seems > like > > > > this > > > > > change takes some control away from clients when it comes to > durability > > > > vs > > > > > availability. > > > > > > > > > > *> * > > > > > *ELR + ISR size will not be dropped below the min ISR unless the > > > > controller > > > > > discovers an ELR member has an unclean shutdown. * > > > > > The KIP seems to suggest that we remove from ELR when we start up > again > > > > and > > > > > notice we do not have the clean shutdown file. Is there a chance we > > > have > > > > an > > > > > offline broker in ELR that had an unclean shutdown that we elect > as a > > > > > leader before we get the change to realize the shutdown was > unclean? > > > > > This seems like it could cause some problems. I may have missed > how we > > > > > avoid this scenario though. > > > > > > > > > > *> When updating the config **min.insync.replicas, * > > > > > *if the new min ISR <= current ISR, the ELR will be removed.*Would > this > > > > be > > > > > the case for strictly a smaller min ISR? I suppose if we increase > the > > > > ISR, > > > > > we can't reason about ELR. Can we reason about high water mark in > this > > > > > case--seems like we will have the broker out of ISR not in ISR or > ELR? > > > > > (Forgive me if we can't increase min ISR if the increase will put > us > > > > under > > > > > it) > > > > > > > > > > *> Unclean recovery. * > > > > > > > > > > - *The unclean leader election will be replaced by the unclean > > > > > recovery.* > > > > > - *unclean.leader.election.enable will only be replaced by > > > > > the unclean.recovery.strategy after ELR is delivered.* > > > > > - *As there is no change to the ISR, the "last known leader" > > > behavior > > > > is > > > > > maintained.* > > > > > > > > > > What does "last known leader behavior maintained" mean here? I > thought > > > we > > > > > said *"*The above “*Last Leader” behavior can’t be maintained with > an > > > > empty > > > > > ISR and it should be removed." *My understanding is once metadata > > > version > > > > > is updated we will always take the more thoughtful unclean election > > > > process > > > > > (ie, inspect the logs) > > > > > > > > > > Overall though, the general KIP is pretty solid. Looking at the > > > rejected > > > > > alternatives, it looks like a lot was considered, so it's nice to > see > > > the > > > > > final proposal. > > > > > > > > > > Justine > > > > > > > > > > On Mon, Aug 14, 2023 at 8:50 AM Calvin Liu > <ca...@confluent.io.invalid > > > > > > > > > wrote: > > > > > > > > > > > 1. Yes, the new protocol requires 2 things to advance the > HWM. a) > > > > The > > > > > > messages have been replicated to the controller-committed ISR > > > > members. > > > > > > b) > > > > > > The number of ISR members should be at least the min ISR. > > > > > > 2. With the current protocol, we are not able to select > broker 1 > > > as > > > > > the > > > > > > leader. If we first imply we have the new HWM requirement in > > > place, > > > > > then > > > > > > broker 1 is a good candidate to choose. The following part of > the > > > > KIP > > > > > > (ELR) > > > > > > part will explain a new mechanism to enable us to choose > broker 1. > > > > > > Note, if > > > > > > both HWM and ELR are in place, broker 1 will be actually > elected > > > in > > > > > T3. > > > > > > > > > > > > > > > > > > On Fri, Aug 11, 2023 at 10:05 AM Jeff Kim > > > > <jeff....@confluent.io.invalid > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Calvin, > > > > > > > > > > > > > > Thanks for the KIP! I'm still digesting it but I have two > > > questions: > > > > > > > > > > > > > > > In the scenario raised in the motivation section, the server > may > > > > > > receive > > > > > > > ack=1 messages during T1 and advance High Watermark when the > leader > > > > > > > is the only one in ISR. > > > > > > > > > > > > > > To confirm, the current protocol allows advancing the HWM if > all > > > > > brokers > > > > > > in > > > > > > > the ISR append to their logs (in this case only the leader). > And > > > > we're > > > > > > > proposing > > > > > > > to advance the HWM only when <at least min.insync.replicas> > brokers > > > > > > > replicate. Is this correct? > > > > > > > > > > > > > > > Then, if we elect broker 1 as the leader at T4, though we can > > > > > guarantee > > > > > > > the safety of ack=all messages, the High Watermark may move > > > backward > > > > > > > which causes further impacts on the consumers. > > > > > > > > > > > > > > How can broker 1 become the leader if it was ineligible in T3? > Or > > > are > > > > > > > you referring to broker 2? > > > > > > > > > > > > > > Thanks, > > > > > > > Jeff > > > > > > > > > > > > > > On Thu, Aug 10, 2023 at 6:48 PM Calvin Liu > > > > <ca...@confluent.io.invalid > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > I'd like to discuss a series of enhancement to the > replication > > > > > > protocol. > > > > > > > > > > > > > > > > A partition replica can experience local data loss in unclean > > > > > shutdown > > > > > > > > scenarios where unflushed data in the OS page cache is lost - > > > such > > > > as > > > > > > an > > > > > > > > availability zone power outage or a server error. The Kafka > > > > > replication > > > > > > > > protocol is designed to handle these situations by removing > such > > > > > > replicas > > > > > > > > from the ISR and only re-adding them once they have caught > up and > > > > > > > therefore > > > > > > > > recovered any lost data. This prevents replicas that lost an > > > > > arbitrary > > > > > > > log > > > > > > > > suffix, which included committed data, from being elected > leader. > > > > > > > > However, there is a "last replica standing" state which when > > > > combined > > > > > > > with > > > > > > > > a data loss unclean shutdown event can turn a local data loss > > > > > scenario > > > > > > > into > > > > > > > > a global data loss scenario, i.e., committed data can be > removed > > > > from > > > > > > all > > > > > > > > replicas. When the last replica in the ISR experiences an > unclean > > > > > > > shutdown > > > > > > > > and loses committed data, it will be reelected leader after > > > > starting > > > > > up > > > > > > > > again, causing rejoining followers to truncate their logs and > > > > thereby > > > > > > > > removing the last copies of the committed records which the > > > leader > > > > > lost > > > > > > > > initially. > > > > > > > > > > > > > > > > The new KIP will maximize the protection and provides > MinISR-1 > > > > > > tolerance > > > > > > > to > > > > > > > > data loss unclean shutdown events. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >