Hi Patrick,

Thanks for the KIP! Fixing such correctness issues is always very welcome -
they're commonly hard to diagnose and debug when they happen in production.

I was wondering if I understood the potential correctness issues correctly.
Here is what I got:


   - If a broker bounces during controlled shutdown, the bounced broker may
   accidentally process its earlier generation’s StopReplicaRequest sent from
   the active controller for one of its follower replicas, leaving the replica
   offline while its remaining replicas may stay online

broker A is initiating a controlled shutdown (restart). The Controller
sends a StopReplicaRequest but it reaches broker A after it has started up
again. He therefore stops replicating those partitions even though he
should just be starting to


   - If the first LeaderAndIsrRequest that a broker processes is sent by
   the active controller before its startup, the broker will overwrite the
   high watermark checkpoint file and may cause incorrect truncation (
   KAFKA-7235 <https://issues.apache.org/jira/browse/KAFKA-7235>)

Controller sends a LeaderAndIsrRequest before broker A initiates a restart.
Broker A restarts and receives the LeaderAndIsrRequest then. It therefore
starts leading for the partitions sent by that request and might stop
leading partitions that it was leading previously.
This was well explained in the linked JIRA, but I cannot understand why
that would happen due to my limited experience. If Broker A leads p1 and
p2, when would a Controller send a LeaderAndIsrRequest with p1 only and not
want Broker A to drop leadership for p2?


   - If a broker bounces very quickly, the controller may start processing
   the BrokerChange event after the broker already re-registers itself in zk.
   In this case, controller will miss the broker restart and will not send any
   requests to the broker for initialization. The broker will not be able to
   accept traffics.

Here the controller will start processing the BrokerChange event (that says
that broker A shutdown) after the broker has come back up and re-registered
himself in ZK?
How will the Controller miss the restart, won't he subsequently receive
another ZK event saying that broker A has come back up?


Could we explain these potential problems in a bit more detail just so they
could be more easily digestable by novices?

Thanks,
Stanislav

On Wed, Oct 10, 2018 at 9:21 AM Dong Lin <lindon...@gmail.com> wrote:

> Hey Patrick,
>
> Thanks much for the KIP. The KIP is very well written.
>
> LGTM.  +1 (binding)
>
> Thanks,
> Dong
>
>
> On Tue, Oct 9, 2018 at 11:46 PM Patrick Huang <hzx...@hotmail.com> wrote:
>
> > Hi All,
> >
> > Please find the below KIP which proposes the concept of broker generation
> > to resolve issues caused by controller missing broker state changes and
> > broker processing outdated control requests.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-380%3A+Detect+outdated+control+requests+and+bounced+brokers+using+broker+generation
> >
> > All comments are appreciated.
> >
> > Best,
> > Zhanxiang (Patrick) Huang
> >
>


-- 
Best,
Stanislav

Reply via email to