Hi Patrick, Thanks for the KIP! Fixing such correctness issues is always very welcome - they're commonly hard to diagnose and debug when they happen in production.
I was wondering if I understood the potential correctness issues correctly. Here is what I got: - If a broker bounces during controlled shutdown, the bounced broker may accidentally process its earlier generation’s StopReplicaRequest sent from the active controller for one of its follower replicas, leaving the replica offline while its remaining replicas may stay online broker A is initiating a controlled shutdown (restart). The Controller sends a StopReplicaRequest but it reaches broker A after it has started up again. He therefore stops replicating those partitions even though he should just be starting to - If the first LeaderAndIsrRequest that a broker processes is sent by the active controller before its startup, the broker will overwrite the high watermark checkpoint file and may cause incorrect truncation ( KAFKA-7235 <https://issues.apache.org/jira/browse/KAFKA-7235>) Controller sends a LeaderAndIsrRequest before broker A initiates a restart. Broker A restarts and receives the LeaderAndIsrRequest then. It therefore starts leading for the partitions sent by that request and might stop leading partitions that it was leading previously. This was well explained in the linked JIRA, but I cannot understand why that would happen due to my limited experience. If Broker A leads p1 and p2, when would a Controller send a LeaderAndIsrRequest with p1 only and not want Broker A to drop leadership for p2? - If a broker bounces very quickly, the controller may start processing the BrokerChange event after the broker already re-registers itself in zk. In this case, controller will miss the broker restart and will not send any requests to the broker for initialization. The broker will not be able to accept traffics. Here the controller will start processing the BrokerChange event (that says that broker A shutdown) after the broker has come back up and re-registered himself in ZK? How will the Controller miss the restart, won't he subsequently receive another ZK event saying that broker A has come back up? Could we explain these potential problems in a bit more detail just so they could be more easily digestable by novices? Thanks, Stanislav On Wed, Oct 10, 2018 at 9:21 AM Dong Lin <lindon...@gmail.com> wrote: > Hey Patrick, > > Thanks much for the KIP. The KIP is very well written. > > LGTM. +1 (binding) > > Thanks, > Dong > > > On Tue, Oct 9, 2018 at 11:46 PM Patrick Huang <hzx...@hotmail.com> wrote: > > > Hi All, > > > > Please find the below KIP which proposes the concept of broker generation > > to resolve issues caused by controller missing broker state changes and > > broker processing outdated control requests. > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-380%3A+Detect+outdated+control+requests+and+bounced+brokers+using+broker+generation > > > > All comments are appreciated. > > > > Best, > > Zhanxiang (Patrick) Huang > > > -- Best, Stanislav