Thanks for explaining Ismael! Breaking down into follow-up KIPs sounds like a good idea.
On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma <ism...@juma.me.uk> wrote: > Hi Boyang, > > Yes, there will be several KIPs that will discuss the items you describe in > detail. Colin, it may be helpful to make this clear in the KIP 500 > description. > > Ismael > > On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen <reluctanthero...@gmail.com> > wrote: > > > Thanks Colin for initiating this important effort! > > > > One question I have is whether we have a session discussing the > controller > > failover in the new architecture? I know we are using Raft protocol to > > failover, yet it's still valuable to discuss the steps new cluster is > going > > to take to reach the stable stage again, so that we could easily measure > > the availability of the metadata servers. > > > > Another suggestion I have is to write a step-by-step design doc like what > > we did in KIP-98 > > < > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging > > >, > > including the new request protocols and how they are interacting in the > new > > cluster. For a complicated change like this, an implementation design doc > > help a lot in the review process, otherwise most discussions we have will > > focus on high level and lose important details as we discover them in the > > post-agreement phase. > > > > Boyang > > > > On Fri, Aug 2, 2019 at 5:17 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > > On Fri, Aug 2, 2019, at 16:33, Jose Armando Garcia Sancio wrote: > > > > Thanks Colin for the detail KIP. I have a few comments and questions. > > > > > > > > In the KIP's Motivation and Overview you mentioned the LeaderAndIsr > and > > > > UpdateMetadata RPC. For example, "updates which the controller > pushes, > > > such > > > > as LeaderAndIsr and UpdateMetadata messages". Is your thinking that > we > > > will > > > > use MetadataFetch as a replacement to just UpdateMetadata only and > add > > > > topic configuration in this state? > > > > > > > > > > Hi Jose, > > > > > > Thanks for taking a look. > > > > > > The goal is for MetadataFetchRequest to replace both > LeaderAndIsrRequest > > > and UpdateMetadataRequest. Topic configurations would be fetched along > > > with the other metadata. > > > > > > > In the section "Broker Metadata Management", you mention "Just like > > with > > > a > > > > fetch request, the broker will track the offset of the last updates > it > > > > fetched". To keep the log consistent Raft requires that the followers > > > keep > > > > all of the log entries (term/epoch and offset) that are after the > > > > highwatermark. Any log entry before the highwatermark can be > > > > compacted/snapshot. Do we expect the MetadataFetch API to only return > > log > > > > entries up to the highwatermark? Unlike the Raft replication API > which > > > > will replicate/fetch log entries after the highwatermark for > consensus? > > > > > > Good question. Clearly, we shouldn't expose metadata updates to the > > > brokers until they've been stored on a majority of the Raft nodes. The > > > most obvious way to do that, like you mentioned, is to have the brokers > > > only fetch up to the HWM, but not beyond. There might be a more clever > > way > > > to do it by fetching the data, but not having the brokers act on it > until > > > the HWM advances. I'm not sure if that's worth it or not. We'll > discuss > > > this more in a separate KIP that just discusses just Raft. > > > > > > > > > > > In section "Broker Metadata Management", you mention "the controller > > will > > > > send a full metadata image rather than a series of deltas". This KIP > > > > doesn't go into the set of operations that need to be supported on > top > > of > > > > Raft but it would be interested if this "full metadata image" could > be > > > > express also as deltas. For example, assuming we are replicating a > map > > > this > > > > "full metadata image" could be a sequence of "put" operations (znode > > > create > > > > to borrow ZK semantics). > > > > > > The full image can definitely be expressed as a sum of deltas. At some > > > point, the number of deltas will get large enough that sending a full > > image > > > is better, though. One question that we're still thinking about is how > > > much of this can be shared with generic Kafka log code, and how much > > should > > > be different. > > > > > > > > > > > In section "Broker Metadata Management", you mention "This request > will > > > > double as a heartbeat, letting the controller know that the broker is > > > > alive". In section "Broker State Machine", you mention "The > > MetadataFetch > > > > API serves as this registration mechanism". Does this mean that the > > > > MetadataFetch Request will optionally include broker configuration > > > > information? > > > > > > I was originally thinking that the MetadataFetchRequest should include > > > broker configuration information. Thinking about this more, maybe we > > > should just have a special registration RPC that contains that > > information, > > > to avoid sending it over the wire all the time. > > > > > > > Does this also mean that MetadataFetch request will result in > > > > a "write"/AppendEntries through the Raft replication protocol before > > you > > > > can send the associated MetadataFetch Response? > > > > > > I think we should require the broker to be out of the Offline state > > before > > > allowing it to fetch metadata, yes. So the separate registration RPC > > > should have completed first. > > > > > > > > > > > In section "Broker State", you mention that a broker can transition > to > > > > online after it is caught with the metadata. What do you mean by > this? > > > > Metadata is always changing. How does the broker know that it is > caught > > > up > > > > since it doesn't participate in the consensus or the advancement of > the > > > > highwatermark? > > > > > > That's a good point. Being "caught up" is somewhat of a fuzzy concept > > > here, since the brokers do not participate in the metadata consensus. > I > > > think ideally we would want to define it in terms of time ("the broker > > has > > > all the updates from the last 2 minutes", for example.) We should > spell > > > this out better in the KIP. > > > > > > > > > > > In section "Start the controller quorum nodes", you mention "Once it > > has > > > > taken over the /controller node, the active controller will proceed > to > > > load > > > > the full state of ZooKeeper. It will write out this information to > the > > > > quorum's metadata storage. After this point, the metadata quorum > will > > be > > > > the metadata store of record, rather than the data in ZooKeeper." > > During > > > > this migration do should we expect to have a small period controller > > > > unavailability while the controller replicas this state to all of the > > > raft > > > > nodes in the controller quorum and we buffer new controller API > > requests? > > > > > > Yes, the controller would be unavailable during this time. I don't > think > > > this will be that different from the current period of unavailability > > when > > > a new controller starts up and needs to load the full state from ZK. > The > > > main difference is that in this period, we'd have to write to the > > > controller quorum rather than just to memory. But we believe this > should > > > be pretty fast. > > > > > > regards, > > > Colin > > > > > > > > > > > Thanks! > > > > -Jose > > > > > > > > > >