Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Boyang Chen Tue, 04 Aug 2020 15:01:11 -0700

Thanks for the KIP Jason, +1 (binding) from me as well for sure :)


On Tue, Aug 4, 2020 at 2:46 PM Colin McCabe <cmcc...@apache.org> wrote:

> On Mon, Aug 3, 2020, at 20:55, Jason Gustafson wrote:
> > Hi Colin,
> >
> > Thanks for the responses.
> >
> > > I have a few lingering questions.  I still don't like the fact that the
> > > leader epoch / fetch epoch is 31 bits.  What happens when this rolls
> over?
> > > Can we just make this 63 bits now so that we never have to worry about
> it
> > > again?  ZK has some awful bugs surrounding 32 bit rollover, due to a
> > > similar decision to use a 32 bit counter in their log structure.
> Doesn't
> > > seem like a good tradeoff.
> >
> > This is a bit difficult to do at the moment since the leader epoch is 4
> > bytes in the message format. One option that I have considered is
> toggling
> > a batch attribute that lets us turn the producerId into an 8-byte leader
> > epoch instead since we do not have a use for it in the metadata quorum.
> We
> > would need another solution if we ever wanted to use Raft for partition
> > replication, but perhaps by then we can make the case for a new message
> > format.
> >
>
> Hi Jason,
>
> Thanks for the explanation.  I suspected that there was a technical
> limitation like this lurking somewhere.  I think a hack like the one you
> suggested would be OK for now.  I just really want to avoid thinking about
> rollover :)
>
> Regarding the epoch overflow, some offline discussions among Jason,
Guozhang, Jose and I reached some conclusions:

1. The current default election timeout is 10 seconds, which means it takes
hundreds of years to be exhausted if just bumping through election timeout.
Even if the user sets it to 1 second, it still needs years to exhaust.

2. The most common case for fast epoch bumps is due to network partition.
If a certain voter couldn't connect to the quorum, it will repeatedly start
elections and do the epoch bump. To mitigate this concern, we already
planned a follow-up KIP to add the `pre-vote` feature to Kafka Raft
implementation as described in the literature to avoid rapid epoch
increments in the algorithm level.

3. As you suggested, the leader epoch overflow was a common problem not
just for Raft. We could kick off a separate KIP to address changing epoch
from 4 bytes to 8 bytes through message format upgrade, to solve the issue
for Kafka in a holistic manner.



> >
> > > Just like in bootstrap.servers, I don't think we want to manually
> assign
> > > IDs per hostname.  The hosts know their own IDs, after all.  Having to
> > > manually specify the IDs also opens up the possibility of
> > > misconfigurations: what I say the foobar server is node 2, but it's
> > > actually node 3? This would make the logs extremely confusing.  I
> realize
> > > this may require a little finesse to do, but there's got to be a way
> we can
> > > avoid hard-coding IDs
> >
> > Fine. We can move this to KIP-631, but I think it would be a mistake to
> > take IDs out of this configuration. For safety, the one thing that the
> > configuration needs to tell us is what the IDs of the voters are. Without
> > that, it's really easy for a cluster to get into a state where none of
> > the quorum members agree on what the proper set of voters is. I think
> > perhaps you are confused on the usage of these IDs. It is what enables
> > validation of voter requests. Without it, a voter would have to accept a
> > vote request from any ID. There is a reason that other consensus systems
> > like zookeeper and etcd require ids when configured statically.
> >
>
> I hadn't considered the fact that we need to validate incoming voter
> requests.  The fact that nodes can have multiple DNS addresses does make
> this difficult to do with just a list of hostnames.
>
> I guess you're right that we should keep the IDs.  But let's be careful to
> validate that the node's ID really is what we think it is, and consider
> that peer failed if it's not.
>
> >
> > > Also, here's another case where we are saying "broker" when we mean
> > > "controller."  It's really hard to break old habits.  :)
> >
> > I think we still have this basic disagreement on the KIP-500 vision :).
> I'm
> > not sure I understand why you are so eager to force users to think about
> > the controller as a separate system. It's almost like Zookeeper is not
> > going anywhere!
> >
>
> Well, KIP-500 clearly does identify the controller as a separate system,
> not as part of the broker, even if it runs in the same JVM.  :) A system
> where all the nodes had the same role would need a fundamentally different
> design, like Cassandra or something.
>
> I know you're joking, but just so that others understand, it's not fair to
> say that "it's almost like ZK is not going anyway."  KIP-500 clusters will
> have simpler deployment and support a lot of interesting use-cases like
> single-JVM clusters, that would not be possible with the current setup.
>
> At the same time, saying "broker" when you mean "controller" confuses
> people.  For example, I had someone ask a question recently about why we
> needed BrokerHeartbeat when Raft already specifies a mechanism for leader
> change.  I had to explain the different between broker nodes and controller
> nodes.
>
> Anyway, +1 (binding).  Excited to see Raftka going forward!
>
> best,
> Colin
>
> >
> > -Jason
> >
> >
> >
> >
> > On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio <jsan...@confluent.io>
> > wrote:
> >
> > > +1.
> > >
> > > Thanks for the detailed KIP!
> > >
> > > On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > > >
> > > > Hi All, I'd like to start a vote on this proposal:
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > > .
> > > > The discussion has been active for a bit more than 3 months and I
> think
> > > the
> > > > main points have been addressed. We have also moved some of the
> pieces
> > > into
> > > > follow-up proposals, such as KIP-630.
> > > >
> > > > Please keep in mind that the details are bound to change as all of
> > > > the pieces start coming together. As usual, we will keep this thread
> > > > notified of such changes.
> > > >
> > > > For me personally, this is super exciting since we have been thinking
> > > about
> > > > this work ever since I started working on Kafka! I am +1 of course.
> > > >
> > > > Best,
> > > > Jason
> > >
> > >
> > >
> > > --
> > > -Jose
> > >
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Reply via email to