Hi Patrick, I am wondering why the controller should send STALE_BROKER_EPOCH error to the broker if the broker epoch is stale? Would this be a little confusing to the current broker if the request was sent by a broker with previous epoch? Should the controller just ignore those requests in that case?
Thanks, Jiangjie (Becket) Qin On Fri, Nov 9, 2018 at 2:17 AM Patrick Huang <hzx...@hotmail.com> wrote: > Hi, > > In this KIP, we are also going to add a new exception and a new error code > "STALE_BROKER_EPOCH" in order to allow the broker to respond back the right > error when it sees outdated broker epoch in the control requests. Since > adding a new exception and error code is also considered as public > interface change, I have updated the original KIP accordingly to include > this change. Feel free to comment if there is any concern. > > Thanks, > Zhanxiang (Patrick) Huang > > ________________________________ > From: Patrick Huang <hzx...@hotmail.com> > Sent: Tuesday, October 23, 2018 6:20 > To: Jun Rao; dev@kafka.apache.org > Subject: Re: [DISCUSS] KIP-380: Detect outdated control requests and > bounced brokers using broker generation > > Agreed. I have updated the PR to add czxid in ControlledShutdownRequest ( > https://github.com/apache/kafka/pull/5821). Appreciated if you can take a > look. > > Btw, I also have the vote thread for this KIP: > https://lists.apache.org/thread.html/3689d83db537d5aa86d10967dee7ee29578897fc123daae4f77a8605@%3Cdev.kafka.apache.org%3E > > Best, > Zhanxiang (Patrick) Huang > > ________________________________ > From: Jun Rao <j...@confluent.io> > Sent: Monday, October 22, 2018 21:31 > To: dev@kafka.apache.org > Subject: Re: [DISCUSS] KIP-380: Detect outdated control requests and > bounced brokers using broker generation > > Hi, Patrick, > > Yes, that's the general sequence. After step 2, the shutting down broker > can give up the controlled shutdown process and proceed to shut down. When > it's restarted, it could still receive StopReplica requests from the > controller in reaction to the previous controlled shutdown requests. This > could lead the restarted broker to a bad state. > > Thanks, > > Jun > > > On Mon, Oct 22, 2018 at 4:32 PM, Patrick Huang <hzx...@hotmail.com> wrote: > > > Hi Jun, > > > > That is a good point. I want to make it clear about the scenario you > > mentioned. Correct me if I am wrong. Here is the sequence of the event: > > > > 1. Broker sends ControlledShutdown request 1 to controller > > 2. Broker sends ControlledShutdown request 2 to controller due to > > reties > > 3. Controller processes ControlledShutdown request 1 > > 4. Controller sends control requests to the broker > > 5. Broker receives ControlledShutdown response 1 from controller > > 6. Broker shuts down and restarts quickly > > 7. Controller processes ControllerShutdown request 2 > > 8. Controller sends control requests to the broker > > > > It is possible that controller processes the broker change event between > > 6) and 7). In this case, controller already updates the cached czxid to > the > > up-to-date ones so the bounced broker will not reject control requests in > > 8), which cause a correctness problem. > > > > > > Best, > > Zhanxiang (Patrick) Huang > > > > ------------------------------ > > *From:* Jun Rao <j...@confluent.io> > > *Sent:* Monday, October 22, 2018 14:45 > > *To:* dev > > *Subject:* Re: [DISCUSS] KIP-380: Detect outdated control requests and > > bounced brokers using broker generation > > > > Hi, Patrick, > > > > There is another thing that may be worth considering. > > > > 10. It will be useful to include the czxid also in the ControlledShutdown > > request. This way, if the broker has been restarted, the controller can > > ignore an old ControlledShutdown request(e.g., due to retries). This will > > prevent the restarted broker from incorrectly stopping replicas. > > > > Thanks, > > > > Jun > > > > > > On Thu, Oct 11, 2018 at 5:56 PM, Patrick Huang <hzxa21.hu...@gmail.com> > > wrote: > > > > > Hi Jun, > > > > > > Thanks a lot for the comments. > > > > > > 1. czxid is globally unique and monotonically increasing based on the > > > zookeeper doc. > > > References (from > > > https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html): > > > "Every change to the ZooKeeper state receives a stamp in the form of a > > > *zxid* (ZooKeeper Transaction Id). This exposes the total ordering of > all > > > changes to ZooKeeper. Each change will have a unique zxid and if zxid1 > is > > > smaller than zxid2 then zxid1 happened before zxid2." > > > "czxid: The zxid of the change that caused this znode to be created." > > > > > > 2. You are right. There will be only on broker change event fired in > the > > > case I mentioned because we will not register the watcher before the > > read. > > > > > > 3. Let's say we want to initialize the states of broker set A and we > want > > > the cluster to be aware of the absence of broker set B. The currently > > live > > > broker set in the cluster is C. > > > > > > From the design point of view, here are the rules for broker state > > > transition: > > > - Pass in broker ids of A for onBrokerStartup() and pass in broker > > ids > > > of B for onBrokerFailure(). > > > - When processing onBrokerStartup(), we use the broker generation > > > controller read from zk to send requests to broker set A and use the > > > previously cached broker generation to send requests to (C-A). > > > - When processing onBrokerFailure(), we use the previously cached > > > broker generation to send requests to C. > > > > > > From the implementation point of view, here are the steps we need > to > > > follow when processing BrokerChangeEvent: > > > - Reads all child nodes in /brokers/ids/ to get current brokers > with > > > broker generation > > > - Detect new brokers, dead brokers and bounced brokers > > > - Update the live broker ids in controller context > > > - Update broker generations for new brokers in controller context > > > - Invoke onBrokerStartup(new brokers) > > > - Invoke onBrokerFailure(bounced brokers) > > > - Update broker generations for bounce brokers in controller > context > > > - Invoke onBrokerStartup(bounced brokers) > > > - Invoke onBrokerFailure(dead brokers) > > > We can further optimize the flow by avoiding sending requests to a > > > broker if its broker generation is larger than the one in the > controller > > > context. > > > > > > I will also update the KIP to clarify how it works for > BrokerChangeEvent > > > processing in more detail. > > > > > > Thanks, > > > Patrick > > > > > > > > > > > > On Thu, Oct 11, 2018 at 12:12 PM Jun Rao <j...@confluent.io> wrote: > > > > > > > Hi, Patrick, > > > > > > > > Thanks for the KIP. Looks good to me overall and very useful. A few > > > > comments below. > > > > > > > > 1. "will reject the requests with smaller broker generation than its > > > > current generation." Is czxid monotonically increasing? > > > > > > > > 2. To clarify on the issue of the controller missing a ZK watcher. ZK > > > > watchers are one-time watchers. Once a watcher is fired, one needs to > > > > register it again before the watcher can be triggered. So, when the > > > > controller is busy and a broker goes down and comes up, the first > event > > > > will trigger the ZK watcher. Since the controller is busy and hasn't > > > > registered the watcher again, the second event actually won't fire. > By > > > the > > > > time the controller reads from ZK, it sees that the broker is still > > > > registered and thus thinks that nothing has happened to the broker, > > which > > > > is causing the problem. > > > > > > > > 3. "Handle broker state change: invoke onBrokerFailure(...) first, > then > > > > invoke onBrokerStartUp(...)". We probably want to be a bit careful > > here. > > > > Could you clarify the broker list and the broker epoch used when > making > > > > these calls? We want to prevent the restarted broker from receiving a > > > > partial replica list on the first LeaderAndIsr request because of > this. > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Wed, Oct 10, 2018 at 12:51 PM, Patrick Huang <hzx...@hotmail.com> > > > > wrote: > > > > > > > > > Hey Stanislav, > > > > > > > > > > Sure. Thanks for your interest in this KIP. I am glad to provide > more > > > > > detail. > > > > > > > > > > broker A is initiating a controlled shutdown (restart). The > > Controller > > > > > sends a StopReplicaRequest but it reaches broker A after it has > > started > > > > up > > > > > again. He therefore stops replicating those partitions even though > he > > > > > should just be starting to > > > > > This is right. > > > > > > > > > > Controller sends a LeaderAndIsrRequest before broker A initiates a > > > > restart. > > > > > Broker A restarts and receives the LeaderAndIsrRequest then. It > > > therefore > > > > > starts leading for the partitions sent by that request and might > stop > > > > > leading partitions that it was leading previously. > > > > > This was well explained in the linked JIRA, but I cannot understand > > why > > > > > that would happen due to my limited experience. If Broker A leads > p1 > > > and > > > > > p2, when would a Controller send a LeaderAndIsrRequest with p1 only > > and > > > > not > > > > > want Broker A to drop leadership for p2? > > > > > The root cause of the issue is that after a broker just restarts, > it > > > > > relies on the first LeaderAndIsrRequest to populate the partition > > state > > > > and > > > > > initializes the highwater mark checkpoint thread. The highwater > mark > > > > > checkpoint thread will overwrite the highwater mark checkpoint file > > > based > > > > > on the broker's in-memory partition states. In other words, If a > > > > partition > > > > > that is physically hosted by the broker is missing in the in-memory > > > > > partition states map, its highwater mark will be lost after the > > > highwater > > > > > mark checkpoint thread overwrites the file. (Related codes: > > > > > > https://github.com/apache/kafka/blob/ed3bd79633ae227ad995dafc3d9f38 > > > > > 4a5534d4e9/core/src/main/scala/kafka/server/ > > > ReplicaManager.scala#L1091) > > > > > [https://avatars3.githubusercontent.com/u/47359?s=400&v=4]< > > > > > > https://github.com/apache/kafka/blob/ed3bd79633ae227ad995dafc3d9f38 > > > > > 4a5534d4e9/core/src/main/scala/kafka/server/ > > > ReplicaManager.scala#L1091> > > > > > > > > > > apache/kafka<https://github.com/apache/kafka/blob/ > > > > > > > > > ed3bd79633ae227ad995dafc3d9f384a5534d4e9/core/src/main/ > > > scala/kafka/server/ > > > > > ReplicaManager.scala#L1091> > > > > > Mirror of Apache Kafka. Contribute to apache/kafka development by > > > > creating > > > > > an account on GitHub. > > > > > github.com > > > > > > > > > > > > > > > In your example, assume the first LeaderAndIsrRequest broker A > > receives > > > > is > > > > > the one initiated in the controlled shutdown logic in Controller to > > > move > > > > > leadership away from broker A. This LeaderAndIsrRequest only > contains > > > > > partitions that broker A leads, not all the partitions that broker > A > > > > hosts > > > > > (i.e. no follower partitions), so the highwater mark for the > follower > > > > > partitions will be lost. Also, the first LeaderAndIsrRequst broker > A > > > > > receives may not necessarily be the one initiated in controlled > > > shutdown > > > > > logic (e.g. there can be an ongoing preferred leader election), > > > although > > > > I > > > > > think this may not be very common. > > > > > > > > > > Here the controller will start processing the BrokerChange event > > (that > > > > says > > > > > that broker A shutdown) after the broker has come back up and > > > > re-registered > > > > > himself in ZK? > > > > > How will the Controller miss the restart, won't he subsequently > > receive > > > > > another ZK event saying that broker A has come back up? > > > > > Controller will not miss the BrokerChange event and actually there > > will > > > > be > > > > > two BrokerChange events fired in this case (one for broker > > > deregistration > > > > > in zk and one for registration). However, when processing the > > > > > BrokerChangeEvent, controller needs to do a read from zookeeper to > > get > > > > back > > > > > the current brokers in the cluster and if the bounced broker > already > > > > joined > > > > > the cluster by this time, controller will not know this broker has > > been > > > > > bounced because it sees no diff between zk and its in-memory cache. > > So > > > > > basically both of the BrokerChange event processing become no-op. > > > > > > > > > > > > > > > Hope that I answer your questions. Feel free to follow up if I am > > > missing > > > > > something. > > > > > > > > > > > > > > > Thanks, > > > > > Zhanxiang (Patrick) Huang > > > > > > > > > > ________________________________ > > > > > From: Stanislav Kozlovski <stanis...@confluent.io> > > > > > Sent: Wednesday, October 10, 2018 7:22 > > > > > To: dev@kafka.apache.org > > > > > Subject: Re: [DISCUSS] KIP-380: Detect outdated control requests > and > > > > > bounced brokers using broker generation > > > > > > > > > > Hi Patrick, > > > > > > > > > > Thanks for the KIP! Fixing such correctness issues is always very > > > > welcome - > > > > > they're commonly hard to diagnose and debug when they happen in > > > > production. > > > > > > > > > > I was wondering if I understood the potential correctness issues > > > > correctly. > > > > > Here is what I got: > > > > > > > > > > > > > > > - If a broker bounces during controlled shutdown, the bounced > > broker > > > > may > > > > > accidentally process its earlier generation’s StopReplicaRequest > > > sent > > > > > from > > > > > the active controller for one of its follower replicas, leaving > > the > > > > > replica > > > > > offline while its remaining replicas may stay online > > > > > > > > > > broker A is initiating a controlled shutdown (restart). The > > Controller > > > > > sends a StopReplicaRequest but it reaches broker A after it has > > started > > > > up > > > > > again. He therefore stops replicating those partitions even though > he > > > > > should just be starting to > > > > > > > > > > > > > > > - If the first LeaderAndIsrRequest that a broker processes is > sent > > > by > > > > > the active controller before its startup, the broker will > > overwrite > > > > the > > > > > high watermark checkpoint file and may cause incorrect > truncation > > ( > > > > > KAFKA-7235 <https://issues.apache.org/jira/browse/KAFKA-7235>) > > > > > > > > > > Controller sends a LeaderAndIsrRequest before broker A initiates a > > > > restart. > > > > > Broker A restarts and receives the LeaderAndIsrRequest then. It > > > therefore > > > > > starts leading for the partitions sent by that request and might > stop > > > > > leading partitions that it was leading previously. > > > > > This was well explained in the linked JIRA, but I cannot understand > > why > > > > > that would happen due to my limited experience. If Broker A leads > p1 > > > and > > > > > p2, when would a Controller send a LeaderAndIsrRequest with p1 only > > and > > > > not > > > > > want Broker A to drop leadership for p2? > > > > > > > > > > > > > > > - If a broker bounces very quickly, the controller may start > > > > processing > > > > > the BrokerChange event after the broker already re-registers > > itself > > > in > > > > > zk. > > > > > In this case, controller will miss the broker restart and will > not > > > > send > > > > > any > > > > > requests to the broker for initialization. The broker will not > be > > > able > > > > > to > > > > > accept traffics. > > > > > > > > > > Here the controller will start processing the BrokerChange event > > (that > > > > says > > > > > that broker A shutdown) after the broker has come back up and > > > > re-registered > > > > > himself in ZK? > > > > > How will the Controller miss the restart, won't he subsequently > > receive > > > > > another ZK event saying that broker A has come back up? > > > > > > > > > > > > > > > Could we explain these potential problems in a bit more detail just > > so > > > > they > > > > > could be more easily digestable by novices? > > > > > > > > > > Thanks, > > > > > Stanislav > > > > > > > > > > On Wed, Oct 10, 2018 at 9:21 AM Dong Lin <lindon...@gmail.com> > > wrote: > > > > > > > > > > > Hey Patrick, > > > > > > > > > > > > Thanks much for the KIP. The KIP is very well written. > > > > > > > > > > > > LGTM. +1 (binding) > > > > > > > > > > > > Thanks, > > > > > > Dong > > > > > > > > > > > > > > > > > > On Tue, Oct 9, 2018 at 11:46 PM Patrick Huang < > hzx...@hotmail.com> > > > > > wrote: > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > Please find the below KIP which proposes the concept of broker > > > > > generation > > > > > > > to resolve issues caused by controller missing broker state > > changes > > > > and > > > > > > > broker processing outdated control requests. > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > 380%3A+Detect+outdated+control+requests+and+bounced+ > > > brokers+using+broker+ > > > > > generation > > > > > > > > > > > > > > All comments are appreciated. > > > > > > > > > > > > > > Best, > > > > > > > Zhanxiang (Patrick) Huang > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, > > > > > Stanislav > > > > > > > > > > > > > > >