Hey Jun, I think I can answer some of your questions on behalf of Colin -- he can confirm if I'm correct
> 10. The KIP adds two new fields (AddingReplicas and RemovingReplicas) to LeaderAndIsr request. Could you explain how these 2 fields will be used? Sorry for not explaining this in the KIP - those fields won't be used by the non-controller brokers yet. Our plans for them are outlined in the Future Work section of the KIP - namely "Reassignment Quotas that only throttle reassignment traffic" and "Add reassignment metrics". > Should we include those two fields in UpdateMetadata and potentially Metadata requests too? I recall this was discussed in the beginning by Colin and Jason, so I'll let Colin answer that question. 11 & 12. Correct, we need to send StopReplica requests. The implementation does this ( https://github.com/apache/kafka/pull/7128/files#diff-ed90e8ecc5439a5ede5e362255d11be1R651) -- I'll update the KIP to mention it as well. I tried to document the algorithm here https://github.com/apache/kafka/pull/7128/files#diff-ed90e8ecc5439a5ede5e362255d11be1R521 . 13. I think so. ( https://github.com/apache/kafka/pull/7128#discussion_r308866206). I'll reflect this in the KIP Here is the updated KIP diff - https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=112820260&selectedPageVersions=36&selectedPageVersions=35 Thanks, Stanislav On Tue, Jul 30, 2019 at 10:18 PM Jun Rao <j...@confluent.io> wrote: > Hi, Colin, > > Thanks for the KIP. Sorry for the late reply. LGTM overall. A few detailed > comments below. > > 10. The KIP adds two new fields (AddingReplicas and RemovingReplicas) to > LeaderAndIsr request. Could you explain how these 2 fields will be used? > Should we include those two fields in UpdateMetadata and potentially > Metadata requests too? > > 11. "If a new reassignment is issued during an on-going one, we cancel the > current one by emptying out both AR and RR, constructing them from (the > updated from the last-reassignment) R and TR, and starting anew." In this > case, it seems that the controller needs to issue a StopReplica request to > remove those unneeded replicas. > > 12. "Essentially, once a cancellation is called we subtract AR from R, > empty out both AR and RR, and send LeaderAndIsr requests to cancel the > replica movements that have not yet completed." Similar to the above, it > seems the controller needs to issue a StopReplica request to remove those > unneeded replicas. > > 13. Since we changed the format of the topics/[topic] zNode, should we bump > up the version number in the json value? > > Jun > > On Mon, Jul 22, 2019 at 8:38 AM Colin McCabe <cmcc...@apache.org> wrote: > > > Hi all, > > > > With three non-binding +1 votes from Viktor Somogyi-Vass, Robert Barrett, > > and George Li, and 3 binding +1 votes from Gwen Shapira, Jason Gustafson, > > and myself, the vote passes. Thanks, everyone! > > > > best, > > Colin > > > > On Fri, Jul 19, 2019, at 08:55, Robert Barrett wrote: > > > +1 (non-binding). Thanks for the KIP! > > > > > > On Thu, Jul 18, 2019 at 5:59 PM George Li <sql_consult...@yahoo.com > > .invalid> > > > wrote: > > > > > > > +1 (non-binding) > > > > > > > > > > > > > > > > Thanks for addressing the comments. > > > > George > > > > > > > > On Thursday, July 18, 2019, 05:03:58 PM PDT, Gwen Shapira < > > > > g...@confluent.io> wrote: > > > > > > > > Renewing my +1, thank you Colin and Stan for working through all the > > > > questions, edge cases, requests and alternatives. We ended up with a > > > > great protocol. > > > > > > > > On Thu, Jul 18, 2019 at 4:54 PM Jason Gustafson <ja...@confluent.io> > > > > wrote: > > > > > > > > > > +1 Thanks for the KIP. Really looking forward to this! > > > > > > > > > > -Jason > > > > > > > > > > On Wed, Jul 17, 2019 at 1:41 PM Colin McCabe <cmcc...@apache.org> > > wrote: > > > > > > > > > > > Thanks, Stanislav. Let's restart the vote to reflect the fact > that > > > > we've > > > > > > made significant changes. The new vote will go for 3 days as > > usual. > > > > > > > > > > > > I'll start with my +1 (binding). > > > > > > > > > > > > best, > > > > > > Colin > > > > > > > > > > > > > > > > > > On Wed, Jul 17, 2019, at 08:56, Stanislav Kozlovski wrote: > > > > > > > Hey everybody, > > > > > > > > > > > > > > We have further iterated on the KIP in the accompanying > > discussion > > > > thread > > > > > > > and I'd like to propose we resume the vote. > > > > > > > > > > > > > > Some notable changes: > > > > > > > - we will store reassignment information in the > > > > `/brokers/topics/[topic]` > > > > > > > - we will internally use two collections to represent a > > reassignment > > > > - > > > > > > > "addingReplicas" and "removingReplicas". LeaderAndIsr has been > > > > updated > > > > > > > accordingly > > > > > > > - the Alter API will still use the "targetReplicas" collection, > > but > > > > the > > > > > > > List API will now return three separate collections - the full > > > > replica > > > > > > set, > > > > > > > the replicas we are adding as part of this reassignment > > > > > > ("addingReplicas") > > > > > > > and the replicas we are removing ("removingReplicas") > > > > > > > - cancellation of a reassignment now means a proper rollback of > > the > > > > > > > assignment to its original state prior to the API call > > > > > > > > > > > > > > As always, you can re-read the KIP here > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment > > > > > > > > > > > > > > Best, > > > > > > > Stanislav > > > > > > > > > > > > > > On Wed, May 22, 2019 at 6:12 PM Colin McCabe < > cmcc...@apache.org > > > > > > > wrote: > > > > > > > > > > > > > > > Hi George, > > > > > > > > > > > > > > > > Thanks for taking a look. I am working on getting a PR done > > as a > > > > > > > > proof-of-concept. I'll post it soon. Then we'll finish up > the > > > > vote. > > > > > > > > > > > > > > > > best, > > > > > > > > Colin > > > > > > > > > > > > > > > > On Tue, May 21, 2019, at 17:33, George Li wrote: > > > > > > > > > Hi Colin, > > > > > > > > > > > > > > > > > > Great! Looking forward to these features. +1 > > (non-binding) > > > > > > > > > > > > > > > > > > What is the estimated timeline to have this implemented? > If > > any > > > > help > > > > > > > > > is needed in the implementation of cancelling > > reassignments, I > > > > can > > > > > > > > > help if there is spare cycle. > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > George > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thursday, May 16, 2019, 9:48:56 AM PDT, Colin McCabe > > > > > > > > > <cmcc...@apache.org> wrote: > > > > > > > > > > > > > > > > > > Hi George, > > > > > > > > > > > > > > > > > > Yes, KIP-455 allows the reassignment of individual > > partitions to > > > > be > > > > > > > > > cancelled. I think it's very important for these > operations > > to > > > > be at > > > > > > > > > the partition level. > > > > > > > > > > > > > > > > > > best, > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > On Tue, May 14, 2019, at 16:34, George Li wrote: > > > > > > > > > > Hi Colin, > > > > > > > > > > > > > > > > > > > > Thanks for the updated KIP. It has very good > improvements > > of > > > > Kafka > > > > > > > > > > reassignment operations. > > > > > > > > > > > > > > > > > > > > One question, looks like the KIP includes the > Cancellation > > of > > > > > > > > > > individual pending reassignments as well when the > > > > > > > > > > AlterPartitionReasisgnmentRequest has empty replicas for > > the > > > > > > > > > > topic/partition. Will you also be implementing the the > > > > partition > > > > > > > > > > cancellation/rollback in the PR ? If yes, it will > make > > > > KIP-236 > > > > > > (it > > > > > > > > > > has PR already) trivial, since the cancel all pending > > > > > > reassignments, > > > > > > > > > > one just needs to do a ListPartitionRessignmentRequest, > > then > > > > submit > > > > > > > > > > empty replicas for all those topic/partitions in > > > > > > > > > > one AlterPartitionReasisgnmentRequest. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > George > > > > > > > > > > > > > > > > > > > > On Friday, May 10, 2019, 8:44:31 PM PDT, Colin McCabe > > > > > > > > > > <cmcc...@apache.org> wrote: > > > > > > > > > > > > > > > > > > > > On Fri, May 10, 2019, at 17:34, Colin McCabe wrote: > > > > > > > > > > > On Fri, May 10, 2019, at 16:43, Jason Gustafson wrote: > > > > > > > > > > > > Hi Colin, > > > > > > > > > > > > > > > > > > > > > > > > I think storing reassignment state at the partition > > level > > > > is > > > > > > the > > > > > > > > right move > > > > > > > > > > > > and I also agree that replicas should understand that > > > > there is > > > > > > a > > > > > > > > > > > > reassignment in progress. This makes KIP-352 a > trivial > > > > > > follow-up > > > > > > > > for > > > > > > > > > > > > example. The only doubt I have is whether the leader > > and > > > > isr > > > > > > znode > > > > > > > > is the > > > > > > > > > > > > right place to store the target reassignment. It is a > > bit > > > > odd > > > > > > to > > > > > > > > keep the > > > > > > > > > > > > target assignment in a separate place from the > current > > > > > > assignment, > > > > > > > > right? I > > > > > > > > > > > > assume the thinking is probably that although the > > current > > > > > > > > assignment should > > > > > > > > > > > > probably be in the leader and isr znode as well, it > is > > > > hard to > > > > > > > > move the > > > > > > > > > > > > state in a compatible way. Is that right? But if we > > have no > > > > > > plan > > > > > > > > to remove > > > > > > > > > > > > the assignment znode, do you see a downside to > storing > > the > > > > > > target > > > > > > > > > > > > assignment there as well? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > > > > > > > > > > > That's a good point -- it's probably better to keep the > > > > target > > > > > > > > > > > assignment in the same znode as the current assignment, > > for > > > > > > > > > > > consistency. I'll change the KIP. > > > > > > > > > > > > > > > > > > > > Hi Jason, > > > > > > > > > > > > > > > > > > > > Thanks again for the review. > > > > > > > > > > > > > > > > > > > > I took another look at this, and I think we should stick > > with > > > > the > > > > > > > > > > initial proposal of putting the reassignment state into > > > > > > > > > > /brokers/topics/[topic]/partitions/[partitionId]/state. > > The > > > > > > reason is > > > > > > > > > > because we'll want to bump the leader epoch for the > > partition > > > > when > > > > > > > > > > changing the reassignment state, and the leader epoch > > resides > > > > in > > > > > > that > > > > > > > > > > znode anyway. I agree there is some inconsistency here, > > but > > > > so be > > > > > > it: > > > > > > > > > > if we were to greenfield these zookeeper data structures, > > we > > > > might > > > > > > do > > > > > > > > > > it differently, but the proposed scheme will work fine > and > > be > > > > > > > > > > extensible for the future. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > A few additional questions: > > > > > > > > > > > > > > > > > > > > > > > > 1. Should `alterPartitionReassignments` be > > > > > > > > `alterPartitionAssignments`? > > > > > > > > > > > > It's the current assignment we're altering, right? > > > > > > > > > > > > > > > > > > > > > > That's fair. AlterPartitionAssigments reads a little > > > > better, and > > > > > > > > I'll > > > > > > > > > > > change it to that. > > > > > > > > > > > > > > > > > > > > +1. I've changed the RPC and API name in the wiki. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Does this change affect the Metadata API? In other > > > > words, > > > > > > are > > > > > > > > clients > > > > > > > > > > > > aware of reassignments? If so, then we probably need > a > > > > change > > > > > > to > > > > > > > > > > > > UpdateMetadata as well. The only alternative I can > > think of > > > > > > would > > > > > > > > be to > > > > > > > > > > > > represent the replica set in the Metadata request as > > the > > > > union > > > > > > of > > > > > > > > the > > > > > > > > > > > > current and target replicas, but I can't think of any > > > > benefit > > > > > > to > > > > > > > > hiding > > > > > > > > > > > > reassignments. Note that if we did this, we probably > > > > wouldn't > > > > > > need > > > > > > > > a > > > > > > > > > > > > separate API to list reassignments. > > > > > > > > > > > > > > > > > > > > > > I thought about this a bit... and I think on balance, > > you're > > > > > > right. > > > > > > > > We > > > > > > > > > > > should keep this information together with the replica > > > > nodes, isr > > > > > > > > > > > nodes, and offline replicas, and that information is > > > > available in > > > > > > > > the > > > > > > > > > > > MetadataResponse. > > > > > > > > > > > However, I do think in order to do this, we'll need a > > flag > > > > in > > > > > > the > > > > > > > > > > > MetadataRequest that specifiies "only show me > reassigning > > > > > > > > partitions". > > > > > > > > > > > I'll add this. > > > > > > > > > > > > > > > > > > > > I revisited this, and I think we should stick with the > > original > > > > > > > > > > proposal of having a separate ListPartitionReassignments > > API. > > > > > > There > > > > > > > > > > really is no use case where the Producer or Consumer > needs > > to > > > > know > > > > > > > > > > about a reassignment. They should just be notified when > > the > > > > set of > > > > > > > > > > partitions changes, which doesn't require changes to > > > > > > > > > > MetadataRequest/Response. The Admin client only cares if > > > > someone > > > > > > is > > > > > > > > > > managing the reassignment. So adding this state to the > > > > > > > > > > MetadataResponse adds overhead for no real benefit. In > the > > > > common > > > > > > > > case > > > > > > > > > > where there is no ongoing reassignment, it would be 4 > > bytes per > > > > > > > > > > partition of extra overhead in the MetadataResponse. > > > > > > > > > > > > > > > > > > > > In general, I think we have a problem of oversharing in > the > > > > > > > > > > MetadataRequest/Response. As we 10x or 100x the number > of > > > > > > partitions > > > > > > > > > > we support, we'll need to get stricter about giving > clients > > > > only > > > > > > the > > > > > > > > > > information they actually need, about the partitions they > > > > actually > > > > > > > > care > > > > > > > > > > about. Reassignment state clearly falls in the category > of > > > > state > > > > > > that > > > > > > > > > > isn't needed by clients (except very specialized > > rebalancing > > > > > > programs). > > > > > > > > > > > > > > > > > > > > Another important consideration here is that someone > > managing > > > > an > > > > > > > > > > ongoing reassignment wants the most up-to-date > information, > > > > which > > > > > > is > > > > > > > > to > > > > > > > > > > be found on the controller. Therefore adding this state > to > > > > > > listTopics > > > > > > > > > > or describeTopics, which could contact any node in the > > > > cluster, is > > > > > > > > > > sub-optimal. > > > > > > > > > > > > > > > > > > > > Finally, adding this to listTopics or describeTopics > feels > > > > like a > > > > > > > > warty > > > > > > > > > > API. It's an extra boolean which interacts with other > > extra > > > > > > booleans > > > > > > > > > > like "show internal", etc. in weird ways. I think a > > separate > > > > API > > > > > > is > > > > > > > > > > cleaner. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. As replicas come into sync, they will join the > ISR. > > > > Will we > > > > > > > > await all > > > > > > > > > > > > target replicas joining the ISR before taking the > > replica > > > > out > > > > > > of > > > > > > > > the target > > > > > > > > > > > > replicas set? Also, I assume that target replicas can > > > > still be > > > > > > > > elected as > > > > > > > > > > > > leader? > > > > > > > > > > > > > > > > > > > > > > We'll take a replica out of the target replicas set as > > soon > > > > as > > > > > > that > > > > > > > > > > > replica is in the ISR. Let me clarify this in the KIP. > > > > > > > > > > > > > > > > > > > > > > > 4. Probably useful to mention permissions for the new > > APIs. > > > > > > > > > > > > > > > > > > > > > > Good point. I think alterPartitionAssignments should > > require > > > > > > ALTER > > > > > > > > on > > > > > > > > > > > CLUSTER. MetadataRequest permissions will be > unchanged. > > > > > > > > > > > > > > > > > > > > I added permission information. > > > > > > > > > > > > > > > > > > > > best, > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > best, > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 10, 2019 at 9:30 AM Gwen Shapira < > > > > > > g...@confluent.io> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > +1 (binding) > > > > > > > > > > > > > Looks great, and will be awesome to have this new > > > > capability. > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 8, 2019 at 10:23 PM Colin McCabe < > > > > > > cmcc...@apache.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to start the vote for KIP-455: Create an > > > > > > > > Administrative API for > > > > > > > > > > > > > > Replica Reassignment. I think this KIP is > > important > > > > since > > > > > > it > > > > > > > > will unlock > > > > > > > > > > > > > > many follow-on improvements to Kafka reassignment > > (see > > > > the > > > > > > > > "Future work" > > > > > > > > > > > > > > section, plus a lot of the other discussions > we've > > had > > > > > > > > recently about > > > > > > > > > > > > > > reassignment). It also furthers the important > > KIP-4 > > > > goal > > > > > > of > > > > > > > > removing > > > > > > > > > > > > > > direct access to ZK. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made a few changes based on the discussion in > the > > > > > > [DISCUSS] > > > > > > > > thread. As > > > > > > > > > > > > > > Robert suggested, I removed the need to > explicitly > > > > cancel a > > > > > > > > reassignment > > > > > > > > > > > > > > for a partition before setting up a different > > > > reassignment > > > > > > for > > > > > > > > that > > > > > > > > > > > > > > specific partition. I also simplified the API a > > bit by > > > > > > adding > > > > > > > > a > > > > > > > > > > > > > > PartitionReassignment class which is used by both > > the > > > > alter > > > > > > > > and list > > > > > > > > > > > > > APIs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I modified the proposal so that we now deprecate > > the > > > > old > > > > > > > > znode-based API > > > > > > > > > > > > > > rather than removing it completely. That should > > give > > > > > > external > > > > > > > > > > > > > rebalancing > > > > > > > > > > > > > > tools some time to transition to the new API. > > > > > > > > > > > > > > > > > > > > > > > > > > > > To clarify a question Viktor asked, I added a > note > > > > that the > > > > > > > > > > > > > > kafka-reassign-partitions.sh will now use a > > > > > > --bootstrap-server > > > > > > > > argument > > > > > > > > > > > > > to > > > > > > > > > > > > > > contact the admin APIs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > thanks, > > > > > > > > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > *Gwen Shapira* > > > > > > > > > > > > > Product Manager | Confluent > > > > > > > > > > > > > 650.450.2760 | @gwenshap > > > > > > > > > > > > > Follow us: Twitter < > https://twitter.com/ConfluentInc> > > | > > > > blog > > > > > > > > > > > > > <http://www.confluent.io/blog> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Gwen Shapira > > > > Product Manager | Confluent > > > > 650.450.2760 | @gwenshap > > > > Follow us: Twitter | blog > > > > > > > > > >