Hey everybody, We have further iterated on the KIP in the accompanying discussion thread and I'd like to propose we resume the vote.
Some notable changes: - we will store reassignment information in the `/brokers/topics/[topic]` - we will internally use two collections to represent a reassignment - "addingReplicas" and "removingReplicas". LeaderAndIsr has been updated accordingly - the Alter API will still use the "targetReplicas" collection, but the List API will now return three separate collections - the full replica set, the replicas we are adding as part of this reassignment ("addingReplicas") and the replicas we are removing ("removingReplicas") - cancellation of a reassignment now means a proper rollback of the assignment to its original state prior to the API call As always, you can re-read the KIP here https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment Best, Stanislav On Wed, May 22, 2019 at 6:12 PM Colin McCabe <cmcc...@apache.org> wrote: > Hi George, > > Thanks for taking a look. I am working on getting a PR done as a > proof-of-concept. I'll post it soon. Then we'll finish up the vote. > > best, > Colin > > On Tue, May 21, 2019, at 17:33, George Li wrote: > > Hi Colin, > > > > Great! Looking forward to these features. +1 (non-binding) > > > > What is the estimated timeline to have this implemented? If any help > > is needed in the implementation of cancelling reassignments, I can > > help if there is spare cycle. > > > > > > Thanks, > > George > > > > > > > > On Thursday, May 16, 2019, 9:48:56 AM PDT, Colin McCabe > > <cmcc...@apache.org> wrote: > > > > Hi George, > > > > Yes, KIP-455 allows the reassignment of individual partitions to be > > cancelled. I think it's very important for these operations to be at > > the partition level. > > > > best, > > Colin > > > > On Tue, May 14, 2019, at 16:34, George Li wrote: > > > Hi Colin, > > > > > > Thanks for the updated KIP. It has very good improvements of Kafka > > > reassignment operations. > > > > > > One question, looks like the KIP includes the Cancellation of > > > individual pending reassignments as well when the > > > AlterPartitionReasisgnmentRequest has empty replicas for the > > > topic/partition. Will you also be implementing the the partition > > > cancellation/rollback in the PR ? If yes, it will make KIP-236 (it > > > has PR already) trivial, since the cancel all pending reassignments, > > > one just needs to do a ListPartitionRessignmentRequest, then submit > > > empty replicas for all those topic/partitions in > > > one AlterPartitionReasisgnmentRequest. > > > > > > > > > Thanks, > > > George > > > > > > On Friday, May 10, 2019, 8:44:31 PM PDT, Colin McCabe > > > <cmcc...@apache.org> wrote: > > > > > > On Fri, May 10, 2019, at 17:34, Colin McCabe wrote: > > > > On Fri, May 10, 2019, at 16:43, Jason Gustafson wrote: > > > > > Hi Colin, > > > > > > > > > > I think storing reassignment state at the partition level is the > right move > > > > > and I also agree that replicas should understand that there is a > > > > > reassignment in progress. This makes KIP-352 a trivial follow-up > for > > > > > example. The only doubt I have is whether the leader and isr znode > is the > > > > > right place to store the target reassignment. It is a bit odd to > keep the > > > > > target assignment in a separate place from the current assignment, > right? I > > > > > assume the thinking is probably that although the current > assignment should > > > > > probably be in the leader and isr znode as well, it is hard to > move the > > > > > state in a compatible way. Is that right? But if we have no plan > to remove > > > > > the assignment znode, do you see a downside to storing the target > > > > > assignment there as well? > > > > > > > > > > > > > Hi Jason, > > > > > > > > That's a good point -- it's probably better to keep the target > > > > assignment in the same znode as the current assignment, for > > > > consistency. I'll change the KIP. > > > > > > Hi Jason, > > > > > > Thanks again for the review. > > > > > > I took another look at this, and I think we should stick with the > > > initial proposal of putting the reassignment state into > > > /brokers/topics/[topic]/partitions/[partitionId]/state. The reason is > > > because we'll want to bump the leader epoch for the partition when > > > changing the reassignment state, and the leader epoch resides in that > > > znode anyway. I agree there is some inconsistency here, but so be it: > > > if we were to greenfield these zookeeper data structures, we might do > > > it differently, but the proposed scheme will work fine and be > > > extensible for the future. > > > > > > > > > > > > A few additional questions: > > > > > > > > > > 1. Should `alterPartitionReassignments` be > `alterPartitionAssignments`? > > > > > It's the current assignment we're altering, right? > > > > > > > > That's fair. AlterPartitionAssigments reads a little better, and > I'll > > > > change it to that. > > > > > > +1. I've changed the RPC and API name in the wiki. > > > > > > > > > > > > 2. Does this change affect the Metadata API? In other words, are > clients > > > > > aware of reassignments? If so, then we probably need a change to > > > > > UpdateMetadata as well. The only alternative I can think of would > be to > > > > > represent the replica set in the Metadata request as the union of > the > > > > > current and target replicas, but I can't think of any benefit to > hiding > > > > > reassignments. Note that if we did this, we probably wouldn't need > a > > > > > separate API to list reassignments. > > > > > > > > I thought about this a bit... and I think on balance, you're right. > We > > > > should keep this information together with the replica nodes, isr > > > > nodes, and offline replicas, and that information is available in > the > > > > MetadataResponse. > > > > However, I do think in order to do this, we'll need a flag in the > > > > MetadataRequest that specifiies "only show me reassigning > partitions". > > > > I'll add this. > > > > > > I revisited this, and I think we should stick with the original > > > proposal of having a separate ListPartitionReassignments API. There > > > really is no use case where the Producer or Consumer needs to know > > > about a reassignment. They should just be notified when the set of > > > partitions changes, which doesn't require changes to > > > MetadataRequest/Response. The Admin client only cares if someone is > > > managing the reassignment. So adding this state to the > > > MetadataResponse adds overhead for no real benefit. In the common > case > > > where there is no ongoing reassignment, it would be 4 bytes per > > > partition of extra overhead in the MetadataResponse. > > > > > > In general, I think we have a problem of oversharing in the > > > MetadataRequest/Response. As we 10x or 100x the number of partitions > > > we support, we'll need to get stricter about giving clients only the > > > information they actually need, about the partitions they actually > care > > > about. Reassignment state clearly falls in the category of state that > > > isn't needed by clients (except very specialized rebalancing programs). > > > > > > Another important consideration here is that someone managing an > > > ongoing reassignment wants the most up-to-date information, which is > to > > > be found on the controller. Therefore adding this state to listTopics > > > or describeTopics, which could contact any node in the cluster, is > > > sub-optimal. > > > > > > Finally, adding this to listTopics or describeTopics feels like a > warty > > > API. It's an extra boolean which interacts with other extra booleans > > > like "show internal", etc. in weird ways. I think a separate API is > > > cleaner. > > > > > > > > > > > > 3. As replicas come into sync, they will join the ISR. Will we > await all > > > > > target replicas joining the ISR before taking the replica out of > the target > > > > > replicas set? Also, I assume that target replicas can still be > elected as > > > > > leader? > > > > > > > > We'll take a replica out of the target replicas set as soon as that > > > > replica is in the ISR. Let me clarify this in the KIP. > > > > > > > > > 4. Probably useful to mention permissions for the new APIs. > > > > > > > > Good point. I think alterPartitionAssignments should require ALTER > on > > > > CLUSTER. MetadataRequest permissions will be unchanged. > > > > > > I added permission information. > > > > > > best, > > > Colin > > > > > > > > > > > best, > > > > Colin > > > > > > > > > > > > > > Thanks, > > > > > Jason > > > > > > > > > > On Fri, May 10, 2019 at 9:30 AM Gwen Shapira <g...@confluent.io> > wrote: > > > > > > > > > > > +1 (binding) > > > > > > Looks great, and will be awesome to have this new capability. > > > > > > > > > > > > On Wed, May 8, 2019 at 10:23 PM Colin McCabe <cmcc...@apache.org> > wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > I'd like to start the vote for KIP-455: Create an > Administrative API for > > > > > > > Replica Reassignment. I think this KIP is important since it > will unlock > > > > > > > many follow-on improvements to Kafka reassignment (see the > "Future work" > > > > > > > section, plus a lot of the other discussions we've had > recently about > > > > > > > reassignment). It also furthers the important KIP-4 goal of > removing > > > > > > > direct access to ZK. > > > > > > > > > > > > > > I made a few changes based on the discussion in the [DISCUSS] > thread. As > > > > > > > Robert suggested, I removed the need to explicitly cancel a > reassignment > > > > > > > for a partition before setting up a different reassignment for > that > > > > > > > specific partition. I also simplified the API a bit by adding > a > > > > > > > PartitionReassignment class which is used by both the alter > and list > > > > > > APIs. > > > > > > > > > > > > > > I modified the proposal so that we now deprecate the old > znode-based API > > > > > > > rather than removing it completely. That should give external > > > > > > rebalancing > > > > > > > tools some time to transition to the new API. > > > > > > > > > > > > > > To clarify a question Viktor asked, I added a note that the > > > > > > > kafka-reassign-partitions.sh will now use a --bootstrap-server > argument > > > > > > to > > > > > > > contact the admin APIs. > > > > > > > > > > > > > > thanks, > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > *Gwen Shapira* > > > > > > Product Manager | Confluent > > > > > > 650.450.2760 | @gwenshap > > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog > > > > > > <http://www.confluent.io/blog> > > > > > > > > > > > > > > > > > > >