Hi all, The KIP has been out for a while, so I'm thinking about calling a vote some time this week.
best, Colin On Mon, Aug 19, 2019, at 15:52, Colin McCabe wrote: > On Mon, Aug 19, 2019, at 12:52, David Arthur wrote: > > Thanks for the KIP, Colin. This looks great! > > > > I really like the idea of separating the Controller and Broker JVMs. > > > > As you alluded to above, it might be nice to have a separate > > broker-registration API to avoid overloading the metadata fetch API. > > > > Hi David, > > Thanks for taking a look. > > I removed the sentence about MetadataFetch also serving as the broker > registration API. I think I agree that we will probably want a > separate RPC to fill this role. We will have a follow-on KIP that will > go into more detail about metadata propagation and registration in the > post-ZK world. That KIP will also have a full description of the > registration RPC, etc. For now, I think the important part for KIP-500 > is that the broker registers with the controller quorum. On > registration, the controller quorum assigns it a new broker epoch, > which can distinguish successive broker incarnations. > > > > > When a broker gets a metadata delta, will it be a sequence of deltas since > > the last update or a cumulative delta since the last update? > > > > It will be a sequence of deltas. Basically, the broker will be reading > from the metadata log. > > > > > Will we include any kind of integrity check on the deltas to ensure the > > brokers > > have applied them correctly? Perhaps this will be addressed in one of the > > follow-on KIPs. > > > > In general, we will have checksums on the metadata that we fetch. This > is similar to how we have checksums on regular data. Or if the > question is about catching logic errors in the metadata handling code, > that sounds more like something that should be caught by test cases. > > best, > Colin > > > > Thanks! > > > > On Fri, Aug 9, 2019 at 1:17 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > > Hi Mickael, > > > > > > Thanks for taking a look. > > > > > > I don't think we want to support that kind of multi-tenancy at the > > > controller level. If the cluster is small enough that we want to pack the > > > controller(s) with something else, we could run them alongside the > > > brokers, > > > or possibly inside three of the broker JVMs. > > > > > > best, > > > Colin > > > > > > > > > On Wed, Aug 7, 2019, at 10:37, Mickael Maison wrote: > > > > Thank Colin for kickstarting this initiative. > > > > > > > > Just one question. > > > > - A nice feature of Zookeeper is the ability to use chroots and have > > > > several Kafka clusters use the same Zookeeper ensemble. Is this > > > > something we should keep? > > > > > > > > Thanks > > > > > > > > On Mon, Aug 5, 2019 at 7:44 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > > > > > > > On Mon, Aug 5, 2019, at 10:02, Tom Bentley wrote: > > > > > > Hi Colin, > > > > > > > > > > > > Thanks for the KIP. > > > > > > > > > > > > Currently ZooKeeper provides a convenient notification mechanism for > > > > > > knowing that broker and topic configuration has changed. While > > > KIP-500 does > > > > > > suggest that incremental metadata update is expected to come to > > > clients > > > > > > eventually, that would seem to imply that for some number of > > > releases there > > > > > > would be no equivalent mechanism for knowing about config changes. > > > Is there > > > > > > any thinking at this point about how a similar notification might be > > > > > > provided in the future? > > > > > > > > > > We could eventually have some inotify-like mechanism where clients > > > could register interest in various types of events and got notified when > > > they happened. Reading the metadata log is conceptually simple. The main > > > complexity would be in setting up an API that made sense and that didn't > > > unduly constrain future implementations. We'd have to think carefully > > > about what the real use-cases for this were, though. > > > > > > > > > > best, > > > > > Colin > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Tom > > > > > > > > > > > > On Mon, Aug 5, 2019 at 3:49 PM Viktor Somogyi-Vass < > > > viktorsomo...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hey Colin, > > > > > > > > > > > > > > I think this is a long-awaited KIP, thanks for driving it. I'm > > > excited to > > > > > > > see this in Kafka once. I collected my questions (and I accept the > > > "TBD" > > > > > > > answer as they might be a bit deep for this high level :) ). > > > > > > > 1.) Are there any specific reasons for the Controller just > > > periodically > > > > > > > persisting its state on disk periodically instead of > > > asynchronously with > > > > > > > every update? Wouldn't less frequent saves increase the chance for > > > missing > > > > > > > a state change if the controller crashes between two saves? > > > > > > > 2.) Why can't we allow brokers to fetch metadata from the follower > > > > > > > controllers? I assume that followers would have up-to-date > > > information > > > > > > > therefore brokers could fetch from there in theory. > > > > > > > > > > > > > > Thanks, > > > > > > > Viktor > > > > > > > > > > > > > > On Sun, Aug 4, 2019 at 6:58 AM Boyang Chen < > > > reluctanthero...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Thanks for explaining Ismael! Breaking down into follow-up KIPs > > > sounds > > > > > > > like > > > > > > > > a good idea. > > > > > > > > > > > > > > > > On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma <ism...@juma.me.uk> > > > wrote: > > > > > > > > > > > > > > > > > Hi Boyang, > > > > > > > > > > > > > > > > > > Yes, there will be several KIPs that will discuss the items > > > > > > > > > you > > > > > > > describe > > > > > > > > in > > > > > > > > > detail. Colin, it may be helpful to make this clear in the KIP > > > 500 > > > > > > > > > description. > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen < > > > reluctanthero...@gmail.com > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks Colin for initiating this important effort! > > > > > > > > > > > > > > > > > > > > One question I have is whether we have a session discussing > > > the > > > > > > > > > controller > > > > > > > > > > failover in the new architecture? I know we are using Raft > > > protocol > > > > > > > to > > > > > > > > > > failover, yet it's still valuable to discuss the steps new > > > cluster is > > > > > > > > > going > > > > > > > > > > to take to reach the stable stage again, so that we could > > > easily > > > > > > > > measure > > > > > > > > > > the availability of the metadata servers. > > > > > > > > > > > > > > > > > > > > Another suggestion I have is to write a step-by-step design > > > doc like > > > > > > > > what > > > > > > > > > > we did in KIP-98 > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging > > > > > > > > > > >, > > > > > > > > > > including the new request protocols and how they are > > > interacting in > > > > > > > the > > > > > > > > > new > > > > > > > > > > cluster. For a complicated change like this, an > > > implementation design > > > > > > > > doc > > > > > > > > > > help a lot in the review process, otherwise most discussions > > > we have > > > > > > > > will > > > > > > > > > > focus on high level and lose important details as we > > > discover them in > > > > > > > > the > > > > > > > > > > post-agreement phase. > > > > > > > > > > > > > > > > > > > > Boyang > > > > > > > > > > > > > > > > > > > > On Fri, Aug 2, 2019 at 5:17 PM Colin McCabe < > > > cmcc...@apache.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 2, 2019, at 16:33, Jose Armando Garcia Sancio > > > wrote: > > > > > > > > > > > > Thanks Colin for the detail KIP. I have a few comments > > > and > > > > > > > > questions. > > > > > > > > > > > > > > > > > > > > > > > > In the KIP's Motivation and Overview you mentioned the > > > > > > > LeaderAndIsr > > > > > > > > > and > > > > > > > > > > > > UpdateMetadata RPC. For example, "updates which the > > > controller > > > > > > > > > pushes, > > > > > > > > > > > such > > > > > > > > > > > > as LeaderAndIsr and UpdateMetadata messages". Is your > > > thinking > > > > > > > that > > > > > > > > > we > > > > > > > > > > > will > > > > > > > > > > > > use MetadataFetch as a replacement to just > > > UpdateMetadata only > > > > > > > and > > > > > > > > > add > > > > > > > > > > > > topic configuration in this state? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jose, > > > > > > > > > > > > > > > > > > > > > > Thanks for taking a look. > > > > > > > > > > > > > > > > > > > > > > The goal is for MetadataFetchRequest to replace both > > > > > > > > > LeaderAndIsrRequest > > > > > > > > > > > and UpdateMetadataRequest. Topic configurations would be > > > fetched > > > > > > > > along > > > > > > > > > > > with the other metadata. > > > > > > > > > > > > > > > > > > > > > > > In the section "Broker Metadata Management", you mention > > > "Just > > > > > > > like > > > > > > > > > > with > > > > > > > > > > > a > > > > > > > > > > > > fetch request, the broker will track the offset of the > > > last > > > > > > > updates > > > > > > > > > it > > > > > > > > > > > > fetched". To keep the log consistent Raft requires that > > > the > > > > > > > > followers > > > > > > > > > > > keep > > > > > > > > > > > > all of the log entries (term/epoch and offset) that are > > > after the > > > > > > > > > > > > highwatermark. Any log entry before the highwatermark > > > can be > > > > > > > > > > > > compacted/snapshot. Do we expect the MetadataFetch API > > > to only > > > > > > > > return > > > > > > > > > > log > > > > > > > > > > > > entries up to the highwatermark? Unlike the Raft > > > replication API > > > > > > > > > which > > > > > > > > > > > > will replicate/fetch log entries after the highwatermark > > > for > > > > > > > > > consensus? > > > > > > > > > > > > > > > > > > > > > > Good question. Clearly, we shouldn't expose metadata > > > updates to > > > > > > > the > > > > > > > > > > > brokers until they've been stored on a majority of the > > > Raft nodes. > > > > > > > > The > > > > > > > > > > > most obvious way to do that, like you mentioned, is to > > > have the > > > > > > > > brokers > > > > > > > > > > > only fetch up to the HWM, but not beyond. There might be > > > a more > > > > > > > > clever > > > > > > > > > > way > > > > > > > > > > > to do it by fetching the data, but not having the brokers > > > act on it > > > > > > > > > until > > > > > > > > > > > the HWM advances. I'm not sure if that's worth it or > > > not. We'll > > > > > > > > > discuss > > > > > > > > > > > this more in a separate KIP that just discusses just Raft. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In section "Broker Metadata Management", you mention > > > > > > > > > > > > "the > > > > > > > > controller > > > > > > > > > > will > > > > > > > > > > > > send a full metadata image rather than a series of > > > deltas". This > > > > > > > > KIP > > > > > > > > > > > > doesn't go into the set of operations that need to be > > > supported > > > > > > > on > > > > > > > > > top > > > > > > > > > > of > > > > > > > > > > > > Raft but it would be interested if this "full metadata > > > image" > > > > > > > could > > > > > > > > > be > > > > > > > > > > > > express also as deltas. For example, assuming we are > > > replicating > > > > > > > a > > > > > > > > > map > > > > > > > > > > > this > > > > > > > > > > > > "full metadata image" could be a sequence of "put" > > > operations > > > > > > > > (znode > > > > > > > > > > > create > > > > > > > > > > > > to borrow ZK semantics). > > > > > > > > > > > > > > > > > > > > > > The full image can definitely be expressed as a sum of > > > deltas. At > > > > > > > > some > > > > > > > > > > > point, the number of deltas will get large enough that > > > sending a > > > > > > > full > > > > > > > > > > image > > > > > > > > > > > is better, though. One question that we're still thinking > > > about is > > > > > > > > how > > > > > > > > > > > much of this can be shared with generic Kafka log code, > > > and how > > > > > > > much > > > > > > > > > > should > > > > > > > > > > > be different. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In section "Broker Metadata Management", you mention > > > "This > > > > > > > request > > > > > > > > > will > > > > > > > > > > > > double as a heartbeat, letting the controller know that > > > the > > > > > > > broker > > > > > > > > is > > > > > > > > > > > > alive". In section "Broker State Machine", you mention > > > "The > > > > > > > > > > MetadataFetch > > > > > > > > > > > > API serves as this registration mechanism". Does this > > > mean that > > > > > > > the > > > > > > > > > > > > MetadataFetch Request will optionally include broker > > > > > > > configuration > > > > > > > > > > > > information? > > > > > > > > > > > > > > > > > > > > > > I was originally thinking that the MetadataFetchRequest > > > should > > > > > > > > include > > > > > > > > > > > broker configuration information. Thinking about this > > > more, maybe > > > > > > > we > > > > > > > > > > > should just have a special registration RPC that contains > > > that > > > > > > > > > > information, > > > > > > > > > > > to avoid sending it over the wire all the time. > > > > > > > > > > > > > > > > > > > > > > > Does this also mean that MetadataFetch request will > > > result in > > > > > > > > > > > > a "write"/AppendEntries through the Raft replication > > > protocol > > > > > > > > before > > > > > > > > > > you > > > > > > > > > > > > can send the associated MetadataFetch Response? > > > > > > > > > > > > > > > > > > > > > > I think we should require the broker to be out of the > > > Offline state > > > > > > > > > > before > > > > > > > > > > > allowing it to fetch metadata, yes. So the separate > > > registration > > > > > > > RPC > > > > > > > > > > > should have completed first. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In section "Broker State", you mention that a broker can > > > > > > > transition > > > > > > > > > to > > > > > > > > > > > > online after it is caught with the metadata. What do you > > > mean by > > > > > > > > > this? > > > > > > > > > > > > Metadata is always changing. How does the broker know > > > that it is > > > > > > > > > caught > > > > > > > > > > > up > > > > > > > > > > > > since it doesn't participate in the consensus or the > > > advancement > > > > > > > of > > > > > > > > > the > > > > > > > > > > > > highwatermark? > > > > > > > > > > > > > > > > > > > > > > That's a good point. Being "caught up" is somewhat of a > > > fuzzy > > > > > > > > concept > > > > > > > > > > > here, since the brokers do not participate in the metadata > > > > > > > consensus. > > > > > > > > > I > > > > > > > > > > > think ideally we would want to define it in terms of time > > > ("the > > > > > > > > broker > > > > > > > > > > has > > > > > > > > > > > all the updates from the last 2 minutes", for example.) > > > We should > > > > > > > > > spell > > > > > > > > > > > this out better in the KIP. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In section "Start the controller quorum nodes", you > > > mention "Once > > > > > > > > it > > > > > > > > > > has > > > > > > > > > > > > taken over the /controller node, the active controller > > > will > > > > > > > proceed > > > > > > > > > to > > > > > > > > > > > load > > > > > > > > > > > > the full state of ZooKeeper. It will write out this > > > information > > > > > > > to > > > > > > > > > the > > > > > > > > > > > > quorum's metadata storage. After this point, the > > > metadata quorum > > > > > > > > > will > > > > > > > > > > be > > > > > > > > > > > > the metadata store of record, rather than the data in > > > ZooKeeper." > > > > > > > > > > During > > > > > > > > > > > > this migration do should we expect to have a small > > > > > > > > > > > > period > > > > > > > > controller > > > > > > > > > > > > unavailability while the controller replicas this state > > > to all of > > > > > > > > the > > > > > > > > > > > raft > > > > > > > > > > > > nodes in the controller quorum and we buffer new > > > controller API > > > > > > > > > > requests? > > > > > > > > > > > > > > > > > > > > > > Yes, the controller would be unavailable during this > > > time. I don't > > > > > > > > > think > > > > > > > > > > > this will be that different from the current period of > > > > > > > unavailability > > > > > > > > > > when > > > > > > > > > > > a new controller starts up and needs to load the full > > > state from > > > > > > > ZK. > > > > > > > > > The > > > > > > > > > > > main difference is that in this period, we'd have to write > > > to the > > > > > > > > > > > controller quorum rather than just to memory. But we > > > believe this > > > > > > > > > should > > > > > > > > > > > be pretty fast. > > > > > > > > > > > > > > > > > > > > > > regards, > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > -Jose > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > David Arthur > > >