Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Colin McCabe Tue, 04 Aug 2020 14:47:08 -0700

On Mon, Aug 3, 2020, at 20:55, Jason Gustafson wrote:
> Hi Colin,
> 
> Thanks for the responses.
> 
> > I have a few lingering questions.  I still don't like the fact that the
> > leader epoch / fetch epoch is 31 bits.  What happens when this rolls over?
> > Can we just make this 63 bits now so that we never have to worry about it
> > again?  ZK has some awful bugs surrounding 32 bit rollover, due to a
> > similar decision to use a 32 bit counter in their log structure.  Doesn't
> > seem like a good tradeoff.
> 
> This is a bit difficult to do at the moment since the leader epoch is 4
> bytes in the message format. One option that I have considered is toggling
> a batch attribute that lets us turn the producerId into an 8-byte leader
> epoch instead since we do not have a use for it in the metadata quorum. We
> would need another solution if we ever wanted to use Raft for partition
> replication, but perhaps by then we can make the case for a new message
> format.
>

Hi Jason,

Thanks for the explanation.  I suspected that there was a technical limitation 
like this lurking somewhere.  I think a hack like the one you suggested would 
be OK for now.  I just really want to avoid thinking about rollover :)

>
> > Just like in bootstrap.servers, I don't think we want to manually assign
> > IDs per hostname.  The hosts know their own IDs, after all.  Having to
> > manually specify the IDs also opens up the possibility of
> > misconfigurations: what I say the foobar server is node 2, but it's
> > actually node 3? This would make the logs extremely confusing.  I realize
> > this may require a little finesse to do, but there's got to be a way we can
> > avoid hard-coding IDs
> 
> Fine. We can move this to KIP-631, but I think it would be a mistake to
> take IDs out of this configuration. For safety, the one thing that the
> configuration needs to tell us is what the IDs of the voters are. Without
> that, it's really easy for a cluster to get into a state where none of
> the quorum members agree on what the proper set of voters is. I think
> perhaps you are confused on the usage of these IDs. It is what enables
> validation of voter requests. Without it, a voter would have to accept a
> vote request from any ID. There is a reason that other consensus systems
> like zookeeper and etcd require ids when configured statically.
>

I hadn't considered the fact that we need to validate incoming voter requests.  
The fact that nodes can have multiple DNS addresses does make this difficult to 
do with just a list of hostnames.

I guess you're right that we should keep the IDs.  But let's be careful to 
validate that the node's ID really is what we think it is, and consider that 
peer failed if it's not.

>
> > Also, here's another case where we are saying "broker" when we mean
> > "controller."  It's really hard to break old habits.  :)
> 
> I think we still have this basic disagreement on the KIP-500 vision :). I'm
> not sure I understand why you are so eager to force users to think about
> the controller as a separate system. It's almost like Zookeeper is not
> going anywhere!
>

Well, KIP-500 clearly does identify the controller as a separate system, not as 
part of the broker, even if it runs in the same JVM.  :) A system where all the 
nodes had the same role would need a fundamentally different design, like 
Cassandra or something.

I know you're joking, but just so that others understand, it's not fair to say 
that "it's almost like ZK is not going anyway."  KIP-500 clusters will have 
simpler deployment and support a lot of interesting use-cases like single-JVM 
clusters, that would not be possible with the current setup.

At the same time, saying "broker" when you mean "controller" confuses people.  
For example, I had someone ask a question recently about why we needed 
BrokerHeartbeat when Raft already specifies a mechanism for leader change.  I 
had to explain the different between broker nodes and controller nodes. 

Anyway, +1 (binding).  Excited to see Raftka going forward!

best,
Colin

>
> -Jason
> 
> 
> 
> 
> On Mon, Aug 3, 2020 at 4:36 PM Jose Garcia Sancio <jsan...@confluent.io>
> wrote:
> 
> > +1.
> >
> > Thanks for the detailed KIP!
> >
> > On Mon, Aug 3, 2020 at 11:03 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> > >
> > > Hi All, I'd like to start a vote on this proposal:
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A+A+Raft+Protocol+for+the+Metadata+Quorum
> > .
> > > The discussion has been active for a bit more than 3 months and I think
> > the
> > > main points have been addressed. We have also moved some of the pieces
> > into
> > > follow-up proposals, such as KIP-630.
> > >
> > > Please keep in mind that the details are bound to change as all of
> > > the pieces start coming together. As usual, we will keep this thread
> > > notified of such changes.
> > >
> > > For me personally, this is super exciting since we have been thinking
> > about
> > > this work ever since I started working on Kafka! I am +1 of course.
> > >
> > > Best,
> > > Jason
> >
> >
> >
> > --
> > -Jose
> >
>

Re: [VOTE] KIP-595: A Raft Protocol for the Metadata Quorum

Reply via email to