Thanks Colin for the detail KIP. I have a few comments and questions. In the KIP's Motivation and Overview you mentioned the LeaderAndIsr and UpdateMetadata RPC. For example, "updates which the controller pushes, such as LeaderAndIsr and UpdateMetadata messages". Is your thinking that we will use MetadataFetch as a replacement to just UpdateMetadata only and add topic configuration in this state?
In the section "Broker Metadata Management", you mention "Just like with a fetch request, the broker will track the offset of the last updates it fetched". To keep the log consistent Raft requires that the followers keep all of the log entries (term/epoch and offset) that are after the highwatermark. Any log entry before the highwatermark can be compacted/snapshot. Do we expect the MetadataFetch API to only return log entries up to the highwatermark? Unlike the Raft replication API which will replicate/fetch log entries after the highwatermark for consensus? In section "Broker Metadata Management", you mention "the controller will send a full metadata image rather than a series of deltas". This KIP doesn't go into the set of operations that need to be supported on top of Raft but it would be interested if this "full metadata image" could be express also as deltas. For example, assuming we are replicating a map this "full metadata image" could be a sequence of "put" operations (znode create to borrow ZK semantics). In section "Broker Metadata Management", you mention "This request will double as a heartbeat, letting the controller know that the broker is alive". In section "Broker State Machine", you mention "The MetadataFetch API serves as this registration mechanism". Does this mean that the MetadataFetch Request will optionally include broker configuration information? Does this also mean that MetadataFetch request will result in a "write"/AppendEntries through the Raft replication protocol before you can send the associated MetadataFetch Response? In section "Broker State", you mention that a broker can transition to online after it is caught with the metadata. What do you mean by this? Metadata is always changing. How does the broker know that it is caught up since it doesn't participate in the consensus or the advancement of the highwatermark? In section "Start the controller quorum nodes", you mention "Once it has taken over the /controller node, the active controller will proceed to load the full state of ZooKeeper. It will write out this information to the quorum's metadata storage. After this point, the metadata quorum will be the metadata store of record, rather than the data in ZooKeeper." During this migration do should we expect to have a small period controller unavailability while the controller replicas this state to all of the raft nodes in the controller quorum and we buffer new controller API requests? Thanks! -Jose