Hey Stanislav, Sure. Thanks for your interest in this KIP. I am glad to provide more detail.
broker A is initiating a controlled shutdown (restart). The Controller sends a StopReplicaRequest but it reaches broker A after it has started up again. He therefore stops replicating those partitions even though he should just be starting to This is right. Controller sends a LeaderAndIsrRequest before broker A initiates a restart. Broker A restarts and receives the LeaderAndIsrRequest then. It therefore starts leading for the partitions sent by that request and might stop leading partitions that it was leading previously. This was well explained in the linked JIRA, but I cannot understand why that would happen due to my limited experience. If Broker A leads p1 and p2, when would a Controller send a LeaderAndIsrRequest with p1 only and not want Broker A to drop leadership for p2? The root cause of the issue is that after a broker just restarts, it relies on the first LeaderAndIsrRequest to populate the partition state and initializes the highwater mark checkpoint thread. The highwater mark checkpoint thread will overwrite the highwater mark checkpoint file based on the broker's in-memory partition states. In other words, If a partition that is physically hosted by the broker is missing in the in-memory partition states map, its highwater mark will be lost after the highwater mark checkpoint thread overwrites the file. (Related codes: https://github.com/apache/kafka/blob/ed3bd79633ae227ad995dafc3d9f384a5534d4e9/core/src/main/scala/kafka/server/ReplicaManager.scala#L1091) [https://avatars3.githubusercontent.com/u/47359?s=400&v=4]<https://github.com/apache/kafka/blob/ed3bd79633ae227ad995dafc3d9f384a5534d4e9/core/src/main/scala/kafka/server/ReplicaManager.scala#L1091> apache/kafka<https://github.com/apache/kafka/blob/ed3bd79633ae227ad995dafc3d9f384a5534d4e9/core/src/main/scala/kafka/server/ReplicaManager.scala#L1091> Mirror of Apache Kafka. Contribute to apache/kafka development by creating an account on GitHub. github.com In your example, assume the first LeaderAndIsrRequest broker A receives is the one initiated in the controlled shutdown logic in Controller to move leadership away from broker A. This LeaderAndIsrRequest only contains partitions that broker A leads, not all the partitions that broker A hosts (i.e. no follower partitions), so the highwater mark for the follower partitions will be lost. Also, the first LeaderAndIsrRequst broker A receives may not necessarily be the one initiated in controlled shutdown logic (e.g. there can be an ongoing preferred leader election), although I think this may not be very common. Here the controller will start processing the BrokerChange event (that says that broker A shutdown) after the broker has come back up and re-registered himself in ZK? How will the Controller miss the restart, won't he subsequently receive another ZK event saying that broker A has come back up? Controller will not miss the BrokerChange event and actually there will be two BrokerChange events fired in this case (one for broker deregistration in zk and one for registration). However, when processing the BrokerChangeEvent, controller needs to do a read from zookeeper to get back the current brokers in the cluster and if the bounced broker already joined the cluster by this time, controller will not know this broker has been bounced because it sees no diff between zk and its in-memory cache. So basically both of the BrokerChange event processing become no-op. Hope that I answer your questions. Feel free to follow up if I am missing something. Thanks, Zhanxiang (Patrick) Huang ________________________________ From: Stanislav Kozlovski <stanis...@confluent.io> Sent: Wednesday, October 10, 2018 7:22 To: dev@kafka.apache.org Subject: Re: [DISCUSS] KIP-380: Detect outdated control requests and bounced brokers using broker generation Hi Patrick, Thanks for the KIP! Fixing such correctness issues is always very welcome - they're commonly hard to diagnose and debug when they happen in production. I was wondering if I understood the potential correctness issues correctly. Here is what I got: - If a broker bounces during controlled shutdown, the bounced broker may accidentally process its earlier generation’s StopReplicaRequest sent from the active controller for one of its follower replicas, leaving the replica offline while its remaining replicas may stay online broker A is initiating a controlled shutdown (restart). The Controller sends a StopReplicaRequest but it reaches broker A after it has started up again. He therefore stops replicating those partitions even though he should just be starting to - If the first LeaderAndIsrRequest that a broker processes is sent by the active controller before its startup, the broker will overwrite the high watermark checkpoint file and may cause incorrect truncation ( KAFKA-7235 <https://issues.apache.org/jira/browse/KAFKA-7235>) Controller sends a LeaderAndIsrRequest before broker A initiates a restart. Broker A restarts and receives the LeaderAndIsrRequest then. It therefore starts leading for the partitions sent by that request and might stop leading partitions that it was leading previously. This was well explained in the linked JIRA, but I cannot understand why that would happen due to my limited experience. If Broker A leads p1 and p2, when would a Controller send a LeaderAndIsrRequest with p1 only and not want Broker A to drop leadership for p2? - If a broker bounces very quickly, the controller may start processing the BrokerChange event after the broker already re-registers itself in zk. In this case, controller will miss the broker restart and will not send any requests to the broker for initialization. The broker will not be able to accept traffics. Here the controller will start processing the BrokerChange event (that says that broker A shutdown) after the broker has come back up and re-registered himself in ZK? How will the Controller miss the restart, won't he subsequently receive another ZK event saying that broker A has come back up? Could we explain these potential problems in a bit more detail just so they could be more easily digestable by novices? Thanks, Stanislav On Wed, Oct 10, 2018 at 9:21 AM Dong Lin <lindon...@gmail.com> wrote: > Hey Patrick, > > Thanks much for the KIP. The KIP is very well written. > > LGTM. +1 (binding) > > Thanks, > Dong > > > On Tue, Oct 9, 2018 at 11:46 PM Patrick Huang <hzx...@hotmail.com> wrote: > > > Hi All, > > > > Please find the below KIP which proposes the concept of broker generation > > to resolve issues caused by controller missing broker state changes and > > broker processing outdated control requests. > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-380%3A+Detect+outdated+control+requests+and+bounced+brokers+using+broker+generation > > > > All comments are appreciated. > > > > Best, > > Zhanxiang (Patrick) Huang > > > -- Best, Stanislav