Hi Calvin, Regarding partition reassignment, I have two comments.
I notice the KIP says "The AlterPartitionReassignments will not change the ELR" however, when a reassignment completes (or reverts) any replicas removed from the replica set would be removed from the ELR. Sounds obvious but I figured we should be explicit about that. Reassignment should also respect min.insync.replicas because currently a reassignment can complete as long as the ISR is not empty and all added replicas are members. However, my TLA+ specification, which now includes reassignment, finds single broker failures that can cause committed data loss - despite the added protection of the ELR and min.insync.replicas=2. These scenarios are limited to shrinking the size of the replica set. If we modify the PartitionChangeBuilder to add the completion condition that the target ISR >= min.insync.replicas, then that closes this last single-broker-failure data loss case. With the above modification, the TLA+ specification of the ELR part of the design is standing up to all safety and liveness checks. The only thing that is not modeled is the unclean recovery though I may leave that as the specification is already very large. Jack On 2023/09/01 22:27:10 Calvin Liu wrote: > Hi Justine > 1. With the new proposal, in order to let the consumer consume a message > when only 1 replica commits it to its log, the min ISR has to be set to 1. > 2. Yes, we will need an unclean recovery if the leader has an unclean > shutdown. > 3. If the min ISR config is changed to a larger value, the ISR and ELR will > not be updated. ISR members are always valid no matter how min ISR changes. > If ELR is not empty, then the HWM can't advance as well after the min ISR > increase, so the ELR members are safe to stay. > 4. I will highlight the explanation. Thanks. > > On Thu, Aug 31, 2023 at 4:35 PM Justine Olshan <jols...@confluent.io.invalid> > wrote: > > > Hey Calvin, > > > > Thanks for the responses. I think I understood most of it, but had a few > > follow up questions > > > > 1. For the acks=1 case, I was wondering if there is any way to continue > > with the current behavior (ie -- we only need one ack to produce to the log > > and consider the request complete.) My understanding is that we can also > > consume from such topics at that point. > > If users wanted this lower durability could they set min.insync.replicas to > > 1? > > > > 2. For the case where we elect a leader that was unknowingly offline. Say > > this replica was the only one in ELR. My understanding is that we would > > promote it to ISR and remove it from ELR when it is the leader, but then we > > would remove it from ISR and have no brokers in ISR or ELR. From there we > > would need to do unclean recovery right? > > > > 3. Did we address the case where dynamically min isr is increased? > > > > 4. I think my comment was more about confusion on the KIP. It was not clear > > to me that the section was describing points if one was done before the > > other. But I now see the sentence explaining that. I think I skipped from > > "delivery plan" to the bullet points. > > > > Justine > > > > On Thu, Aug 31, 2023 at 4:04 PM Calvin Liu <ca...@confluent.io.invalid> > > wrote: > > > > > Hi Justine > > > Thanks for the questions! > > > *a. For my understanding, will we block replication? Or just the high > > > watermark advancement?* > > > - The replication will not be blocked. The followers are free to > > > replicate messages above the HWM. Only HWM advancement is blocked. > > > > > > b. *Also in the acks=1 case, if folks want to continue the previous > > > behavior, they also need to set min.insync.replicas to 1, correct?* > > > - If the clients only send ack=1 messages and minISR=2. The HWM > > behavior > > > will only be different when there is 1 replica in the ISR. In this case, > > > the min ISR does not do much in the current system. It is kind of a > > > trade-off but we think it is ok. > > > > > > c. *The KIP seems to suggest that we remove from ELR when we start up > > > again and notice we do not have the clean shutdown file. Is there a > > chance > > > we have an offline broker in ELR that had an unclean shutdown that we > > elect > > > as a leader before we get the change to realize the shutdown was > > unclean?* > > > * - *The controller will only elect an unfenced(online) replica as the > > > leader. If a broker has an unclean shutdown, it should register to the > > > controller first(where it has to declair whether it is a clean/unclean > > > shutdown) and then start to serve broker requests. So > > > 1. If the broker has an unclean shutdown before the controller is > > > aware that the replica is offline, then the broker can become the leader > > > temporarily. But it can't serve any Fetch requests before it registers > > > again, and that's when the controller will re-elect a leader. > > > 2. If the controller knows the replica is offline(missing heartbeats > > > from the broker for a while) before the broker re-registers, the broker > > > can't be elected as a leader. > > > > > > d. *Would this be the case for strictly a smaller min ISR?* > > > - Yes, only when we have a smaller min ISR. Once the leader is aware of > > the > > > minISR change, the HWM can advance and make the current ELR obsolete. So > > > the controller should clear the ELR if the ISR >= the new min ISR. > > > > > > e. *I thought we said the above "Last Leader” behavior can’t be > > maintained > > > with an empty ISR and it should be removed."* > > > - As the Kip is a big one, we have to consider delivering it in phases. > > If > > > only the Unclean Recovery is delivered, we do not touch the ISR then the > > > ISR behavior will be the same as the current. I am open to the proposal > > > that directly starting unclean recovery if the last leader fails. Let's > > see > > > if other folks hope to have more if Unclean Recover delivers first. > > > > > > On Tue, Aug 29, 2023 at 4:53 PM Justine Olshan > > > <jols...@confluent.io.invalid> > > > wrote: > > > > > > > Hey Calvin, > > > > > > > > Thanks for the KIP. This will close some of the gaps in leader > > election! > > > I > > > > has a few questions: > > > > > > > > *>* *High Watermark can only advance if the ISR size is larger or equal > > > > to min.insync.replicas*. > > > > > > > > For my understanding, will we block replication? Or just the high > > > watermark > > > > advancement? > > > > Also in the acks=1 case, if folks want to continue the previous > > behavior, > > > > they also need to set min.insync.replicas to 1, correct? It seems like > > > this > > > > change takes some control away from clients when it comes to durability > > > vs > > > > availability. > > > > > > > > *> * > > > > *ELR + ISR size will not be dropped below the min ISR unless the > > > controller > > > > discovers an ELR member has an unclean shutdown. * > > > > The KIP seems to suggest that we remove from ELR when we start up again > > > and > > > > notice we do not have the clean shutdown file. Is there a chance we > > have > > > an > > > > offline broker in ELR that had an unclean shutdown that we elect as a > > > > leader before we get the change to realize the shutdown was unclean? > > > > This seems like it could cause some problems. I may have missed how we > > > > avoid this scenario though. > > > > > > > > *> When updating the config **min.insync.replicas, * > > > > *if the new min ISR <= current ISR, the ELR will be removed.*Would this > > > be > > > > the case for strictly a smaller min ISR? I suppose if we increase the > > > ISR, > > > > we can't reason about ELR. Can we reason about high water mark in this > > > > case--seems like we will have the broker out of ISR not in ISR or ELR? > > > > (Forgive me if we can't increase min ISR if the increase will put us > > > under > > > > it) > > > > > > > > *> Unclean recovery. * > > > > > > > > - *The unclean leader election will be replaced by the unclean > > > > recovery.* > > > > - *unclean.leader.election.enable will only be replaced by > > > > the unclean.recovery.strategy after ELR is delivered.* > > > > - *As there is no change to the ISR, the "last known leader" > > behavior > > > is > > > > maintained.* > > > > > > > > What does "last known leader behavior maintained" mean here? I thought > > we > > > > said *"*The above “*Last Leader” behavior can’t be maintained with an > > > empty > > > > ISR and it should be removed." *My understanding is once metadata > > version > > > > is updated we will always take the more thoughtful unclean election > > > process > > > > (ie, inspect the logs) > > > > > > > > Overall though, the general KIP is pretty solid. Looking at the > > rejected > > > > alternatives, it looks like a lot was considered, so it's nice to see > > the > > > > final proposal. > > > > > > > > Justine > > > > > > > > On Mon, Aug 14, 2023 at 8:50 AM Calvin Liu <ca...@confluent.io.invalid > > > > > > > wrote: > > > > > > > > > 1. Yes, the new protocol requires 2 things to advance the HWM. a) > > > The > > > > > messages have been replicated to the controller-committed ISR > > > members. > > > > > b) > > > > > The number of ISR members should be at least the min ISR. > > > > > 2. With the current protocol, we are not able to select broker 1 > > as > > > > the > > > > > leader. If we first imply we have the new HWM requirement in > > place, > > > > then > > > > > broker 1 is a good candidate to choose. The following part of the > > > KIP > > > > > (ELR) > > > > > part will explain a new mechanism to enable us to choose broker 1. > > > > > Note, if > > > > > both HWM and ELR are in place, broker 1 will be actually elected > > in > > > > T3. > > > > > > > > > > > > > > > On Fri, Aug 11, 2023 at 10:05 AM Jeff Kim > > > <jeff....@confluent.io.invalid > > > > > > > > > > wrote: > > > > > > > > > > > Hi Calvin, > > > > > > > > > > > > Thanks for the KIP! I'm still digesting it but I have two > > questions: > > > > > > > > > > > > > In the scenario raised in the motivation section, the server may > > > > > receive > > > > > > ack=1 messages during T1 and advance High Watermark when the leader > > > > > > is the only one in ISR. > > > > > > > > > > > > To confirm, the current protocol allows advancing the HWM if all > > > > brokers > > > > > in > > > > > > the ISR append to their logs (in this case only the leader). And > > > we're > > > > > > proposing > > > > > > to advance the HWM only when <at least min.insync.replicas> brokers > > > > > > replicate. Is this correct? > > > > > > > > > > > > > Then, if we elect broker 1 as the leader at T4, though we can > > > > guarantee > > > > > > the safety of ack=all messages, the High Watermark may move > > backward > > > > > > which causes further impacts on the consumers. > > > > > > > > > > > > How can broker 1 become the leader if it was ineligible in T3? Or > > are > > > > > > you referring to broker 2? > > > > > > > > > > > > Thanks, > > > > > > Jeff > > > > > > > > > > > > On Thu, Aug 10, 2023 at 6:48 PM Calvin Liu > > > <ca...@confluent.io.invalid > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi everyone, > > > > > > > I'd like to discuss a series of enhancement to the replication > > > > > protocol. > > > > > > > > > > > > > > A partition replica can experience local data loss in unclean > > > > shutdown > > > > > > > scenarios where unflushed data in the OS page cache is lost - > > such > > > as > > > > > an > > > > > > > availability zone power outage or a server error. The Kafka > > > > replication > > > > > > > protocol is designed to handle these situations by removing such > > > > > replicas > > > > > > > from the ISR and only re-adding them once they have caught up and > > > > > > therefore > > > > > > > recovered any lost data. This prevents replicas that lost an > > > > arbitrary > > > > > > log > > > > > > > suffix, which included committed data, from being elected leader. > > > > > > > However, there is a "last replica standing" state which when > > > combined > > > > > > with > > > > > > > a data loss unclean shutdown event can turn a local data loss > > > > scenario > > > > > > into > > > > > > > a global data loss scenario, i.e., committed data can be removed > > > from > > > > > all > > > > > > > replicas. When the last replica in the ISR experiences an unclean > > > > > > shutdown > > > > > > > and loses committed data, it will be reelected leader after > > > starting > > > > up > > > > > > > again, causing rejoining followers to truncate their logs and > > > thereby > > > > > > > removing the last copies of the committed records which the > > leader > > > > lost > > > > > > > initially. > > > > > > > > > > > > > > The new KIP will maximize the protection and provides MinISR-1 > > > > > tolerance > > > > > > to > > > > > > > data loss unclean shutdown events. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas > > > > > > > > > > > > > > > > > > > > > > > > > > > >