Hi Colin, Thanks for the updated KIP. It has very good improvements of Kafka reassignment operations.
One question, looks like the KIP includes the Cancellation of individual pending reassignments as well when the AlterPartitionReasisgnmentRequest has empty replicas for the topic/partition. Will you also be implementing the the partition cancellation/rollback in the PR ? If yes, it will make KIP-236 (it has PR already) trivial, since the cancel all pending reassignments, one just needs to do a ListPartitionRessignmentRequest, then submit empty replicas for all those topic/partitions in one AlterPartitionReasisgnmentRequest. Thanks, George On Friday, May 10, 2019, 8:44:31 PM PDT, Colin McCabe <cmcc...@apache.org> wrote: On Fri, May 10, 2019, at 17:34, Colin McCabe wrote: > On Fri, May 10, 2019, at 16:43, Jason Gustafson wrote: > > Hi Colin, > > > > I think storing reassignment state at the partition level is the right move > > and I also agree that replicas should understand that there is a > > reassignment in progress. This makes KIP-352 a trivial follow-up for > > example. The only doubt I have is whether the leader and isr znode is the > > right place to store the target reassignment. It is a bit odd to keep the > > target assignment in a separate place from the current assignment, right? I > > assume the thinking is probably that although the current assignment should > > probably be in the leader and isr znode as well, it is hard to move the > > state in a compatible way. Is that right? But if we have no plan to remove > > the assignment znode, do you see a downside to storing the target > > assignment there as well? > > > > Hi Jason, > > That's a good point -- it's probably better to keep the target > assignment in the same znode as the current assignment, for > consistency. I'll change the KIP. Hi Jason, Thanks again for the review. I took another look at this, and I think we should stick with the initial proposal of putting the reassignment state into /brokers/topics/[topic]/partitions/[partitionId]/state. The reason is because we'll want to bump the leader epoch for the partition when changing the reassignment state, and the leader epoch resides in that znode anyway. I agree there is some inconsistency here, but so be it: if we were to greenfield these zookeeper data structures, we might do it differently, but the proposed scheme will work fine and be extensible for the future. > > > A few additional questions: > > > > 1. Should `alterPartitionReassignments` be `alterPartitionAssignments`? > > It's the current assignment we're altering, right? > > That's fair. AlterPartitionAssigments reads a little better, and I'll > change it to that. +1. I've changed the RPC and API name in the wiki. > > > 2. Does this change affect the Metadata API? In other words, are clients > > aware of reassignments? If so, then we probably need a change to > > UpdateMetadata as well. The only alternative I can think of would be to > > represent the replica set in the Metadata request as the union of the > > current and target replicas, but I can't think of any benefit to hiding > > reassignments. Note that if we did this, we probably wouldn't need a > > separate API to list reassignments. > > I thought about this a bit... and I think on balance, you're right. We > should keep this information together with the replica nodes, isr > nodes, and offline replicas, and that information is available in the > MetadataResponse. > However, I do think in order to do this, we'll need a flag in the > MetadataRequest that specifiies "only show me reassigning partitions". > I'll add this. I revisited this, and I think we should stick with the original proposal of having a separate ListPartitionReassignments API. There really is no use case where the Producer or Consumer needs to know about a reassignment. They should just be notified when the set of partitions changes, which doesn't require changes to MetadataRequest/Response. The Admin client only cares if someone is managing the reassignment. So adding this state to the MetadataResponse adds overhead for no real benefit. In the common case where there is no ongoing reassignment, it would be 4 bytes per partition of extra overhead in the MetadataResponse. In general, I think we have a problem of oversharing in the MetadataRequest/Response. As we 10x or 100x the number of partitions we support, we'll need to get stricter about giving clients only the information they actually need, about the partitions they actually care about. Reassignment state clearly falls in the category of state that isn't needed by clients (except very specialized rebalancing programs). Another important consideration here is that someone managing an ongoing reassignment wants the most up-to-date information, which is to be found on the controller. Therefore adding this state to listTopics or describeTopics, which could contact any node in the cluster, is sub-optimal. Finally, adding this to listTopics or describeTopics feels like a warty API. It's an extra boolean which interacts with other extra booleans like "show internal", etc. in weird ways. I think a separate API is cleaner. > > > 3. As replicas come into sync, they will join the ISR. Will we await all > > target replicas joining the ISR before taking the replica out of the target > > replicas set? Also, I assume that target replicas can still be elected as > > leader? > > We'll take a replica out of the target replicas set as soon as that > replica is in the ISR. Let me clarify this in the KIP. > > > 4. Probably useful to mention permissions for the new APIs. > > Good point. I think alterPartitionAssignments should require ALTER on > CLUSTER. MetadataRequest permissions will be unchanged. I added permission information. best, Colin > > best, > Colin > > > > > Thanks, > > Jason > > > > On Fri, May 10, 2019 at 9:30 AM Gwen Shapira <g...@confluent.io> wrote: > > > > > +1 (binding) > > > Looks great, and will be awesome to have this new capability. > > > > > > On Wed, May 8, 2019 at 10:23 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > > > > Hi all, > > > > > > > > I'd like to start the vote for KIP-455: Create an Administrative API for > > > > Replica Reassignment. I think this KIP is important since it will > > > > unlock > > > > many follow-on improvements to Kafka reassignment (see the "Future work" > > > > section, plus a lot of the other discussions we've had recently about > > > > reassignment). It also furthers the important KIP-4 goal of removing > > > > direct access to ZK. > > > > > > > > I made a few changes based on the discussion in the [DISCUSS] thread. > > > > As > > > > Robert suggested, I removed the need to explicitly cancel a reassignment > > > > for a partition before setting up a different reassignment for that > > > > specific partition. I also simplified the API a bit by adding a > > > > PartitionReassignment class which is used by both the alter and list > > > APIs. > > > > > > > > I modified the proposal so that we now deprecate the old znode-based API > > > > rather than removing it completely. That should give external > > > rebalancing > > > > tools some time to transition to the new API. > > > > > > > > To clarify a question Viktor asked, I added a note that the > > > > kafka-reassign-partitions.sh will now use a --bootstrap-server argument > > > to > > > > contact the admin APIs. > > > > > > > > thanks, > > > > Colin > > > > > > > > > > > > > -- > > > *Gwen Shapira* > > > Product Manager | Confluent > > > 650.450.2760 | @gwenshap > > > Follow us: Twitter <https://twitter.com/ConfluentInc> | blog > > > <http://www.confluent.io/blog> > > > > > >