Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Colin McCabe Fri, 02 Aug 2019 17:17:44 -0700

On Fri, Aug 2, 2019, at 16:33, Jose Armando Garcia Sancio wrote:
> Thanks Colin for the detail KIP. I have a few comments and questions.
> 
> In the KIP's Motivation and Overview you mentioned the LeaderAndIsr and
> UpdateMetadata RPC. For example, "updates which the controller pushes, such
> as LeaderAndIsr and UpdateMetadata messages". Is your thinking that we will
> use MetadataFetch as a replacement to just UpdateMetadata only and add
> topic configuration in this state?
>


Hi Jose,

Thanks for taking a look.

The goal is for MetadataFetchRequest to replace both LeaderAndIsrRequest and 
UpdateMetadataRequest.  Topic configurations would be fetched along with the 
other metadata.

> In the section "Broker Metadata Management", you mention "Just like with a
> fetch request, the broker will track the offset of the last updates it
> fetched". To keep the log consistent Raft requires that the followers keep
> all of the log entries (term/epoch and offset) that are after the
> highwatermark. Any log entry before the highwatermark can be
> compacted/snapshot. Do we expect the MetadataFetch API to only return log
> entries up to the highwatermark?  Unlike the Raft replication API which
> will replicate/fetch log entries after the highwatermark for consensus?

Good question.  Clearly, we shouldn't expose metadata updates to the brokers 
until they've been stored on a majority of the Raft nodes.  The most obvious 
way to do that, like you mentioned, is to have the brokers only fetch up to the 
HWM, but not beyond.  There might be a more clever way to do it by fetching the 
data, but not having the brokers act on it until the HWM advances.  I'm not 
sure if that's worth it or not.  We'll discuss this more in a separate KIP that 
just discusses just Raft.

> 
> In section "Broker Metadata Management", you mention "the controller will
> send a full metadata image rather than a series of deltas". This KIP
> doesn't go into the set of operations that need to be supported on top of
> Raft but it would be interested if this "full metadata image" could be
> express also as deltas. For example, assuming we are replicating a map this
> "full metadata image" could be a sequence of "put" operations (znode create
> to borrow ZK semantics).

The full image can definitely be expressed as a sum of deltas.  At some point, 
the number of deltas will get large enough that sending a full image is better, 
though.  One question that we're still thinking about is how much of this can 
be shared with generic Kafka log code, and how much should be different.

> 
> In section "Broker Metadata Management", you mention "This request will
> double as a heartbeat, letting the controller know that the broker is
> alive". In section "Broker State Machine", you mention "The MetadataFetch
> API serves as this registration mechanism". Does this mean that the
> MetadataFetch Request will optionally include broker configuration
> information?

I was originally thinking that the MetadataFetchRequest should include broker 
configuration information.  Thinking about this more, maybe we should just have 
a special registration RPC that contains that information, to avoid sending it 
over the wire all the time.

> Does this also mean that MetadataFetch request will result in
> a "write"/AppendEntries through the Raft replication protocol before you
> can send the associated MetadataFetch Response?

I think we should require the broker to be out of the Offline state before 
allowing it to fetch metadata, yes.  So the separate registration RPC should 
have completed first.

> 
> In section "Broker State", you mention that a broker can transition to
> online after it is caught with the metadata. What do you mean by this?
> Metadata is always changing. How does the broker know that it is caught up
> since it doesn't participate in the consensus or the advancement of the
> highwatermark?

That's a good point.  Being "caught up" is somewhat of a fuzzy concept here, 
since the brokers do not participate in the metadata consensus.  I think 
ideally we would want to define it in terms of time ("the broker has all the 
updates from the last 2 minutes", for example.)  We should spell this out 
better in the KIP.

> 
> In section "Start the controller quorum nodes", you mention "Once it has
> taken over the /controller node, the active controller will proceed to load
> the full state of ZooKeeper.  It will write out this information to the
> quorum's metadata storage.  After this point, the metadata quorum will be
> the metadata store of record, rather than the data in ZooKeeper." During
> this migration do should we expect to have a small period controller
> unavailability while the controller replicas this state to all of the raft
> nodes in the controller quorum and we buffer new controller API requests?

Yes, the controller would be unavailable during this time.  I don't think this 
will be that different from the current period of unavailability when a new 
controller starts up and needs to load the full state from ZK.  The main 
difference is that in this period, we'd have to write to the controller quorum 
rather than just to memory.  But we believe this should be pretty fast.

regards,
Colin

> 
> Thanks!
> -Jose
>

Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum

Reply via email to