Just wanted to mention that I've started KIP-240, which builds on top of this one to provide an AdminClient API for listing and describing reassignments.
On 15 December 2017 at 14:34, Tom Bentley <t.j.bent...@gmail.com> wrote: > > Should we seek to improve this algorithm in this KIP, or leave that as > a later optimisation? > > I've updated the KIP with a proposed algorithm. > > > > > > > On 14 December 2017 at 09:57, Tom Bentley <t.j.bent...@gmail.com> wrote: > >> Thanks Ted, now fixed. >> >> On 13 December 2017 at 18:38, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Tom: >>> bq. create a znode /admin/reassignments/$topic-$partition >>> >>> Looks like the tree structure above should be: >>> >>> /admin/reassignments/$topic/$partition >>> >>> bq. The controller removes /admin/reassignment/$topic/$partition >>> >>> Note the lack of 's' for reassignment. It would be good to make zookeeper >>> paths consistent. >>> >>> Thanks >>> >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t.j.bent...@gmail.com> >>> wrote: >>> >>> > Hi Jun and Ted, >>> > >>> > Jun, you're right that needing one watcher per reassigned partition >>> > presents a scalability problem, and using a separate notification path >>> > solves that. I also agree that it makes sense to prevent users from >>> using >>> > both methods on the same reassignment. >>> > >>> > Ted, naming the reassignments like mytopic-42 was simpler while I was >>> > proposing a watcher-per-reassignment (I'd have needed a child watcher >>> on >>> > /admin/reassignments and also on /admin/reassignments/mytopic). Using >>> the >>> > separate notification path means I don't need any watchers in the >>> > /admin/reassignments subtree, so switching to >>> /admin/reassignments/mytopic/ >>> > 42 >>> > would work, and avoid /admin/reassignments having a very large number >>> of >>> > child nodes. On the other hand it also means I have to create and >>> delete >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the >>> cost >>> > of extra round trips to zookeeper. I suppose that since reassignment is >>> > generally a slow process it makes little difference if we increase the >>> > latency of the interactions with zookeeper. >>> > >>> > I have updated the KIP with these improvements, and a more detailed >>> > description of exactly how we would manage these znodes. >>> > >>> > Reading the algorithm in KafkaController.onPartitionReassignment(), it >>> > seems that it would be suboptimal for changing reassignments in-flight. >>> > Consider an initial assignment of [1,2], reassigned to [2,3] and then >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas until >>> > broker 4 is in sync, even though 3 wasn't actually one of the original >>> > assigned replicas and is no longer a new assigned replica. I think this >>> > also affects the case where the reassignment is cancelled >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even >>> though >>> > its replica will then be deleted. >>> > >>> > Should we seek to improve this algorithm in this KIP, or leave that as >>> a >>> > later optimisation? >>> > >>> > Cheers, >>> > >>> > Tom >>> > >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote: >>> > >>> > > Another question is on the compatibility. Since now there are 2 ways >>> of >>> > > specifying a partition reassignment, one under >>> /admin/reassign_partitions >>> > > and the other under /admin/reassignments, we probably want to >>> prevent the >>> > > same topic being reassigned under both paths at the same time? >>> > > Thanks, >>> > > >>> > > Jun >>> > > >>> > > >>> > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote: >>> > > >>> > > > Hi, Tom, >>> > > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain points >>> in >>> > > > partition reassignment. Another issue that it also addresses is >>> the ZK >>> > > node >>> > > > size limit when writing the reassignment JSON. >>> > > > >>> > > > My only concern is that the KIP needs to create one watcher per >>> > > reassigned >>> > > > partition. This could add overhead in ZK and complexity for >>> debugging >>> > > when >>> > > > lots of partitions are being reassigned simultaneously. We could >>> > > > potentially improve this by introducing a separate ZK path for >>> change >>> > > > notification as we do for configs. For example, every time we >>> change >>> > the >>> > > > assignment for a set of partitions, we could further write a >>> sequential >>> > > > node /admin/reassignment_changes/[change_x]. That way, the >>> controller >>> > > > only needs to watch the change path. Once a change is triggered, >>> the >>> > > > controller can read everything under /admin/reassignments/. >>> > > > >>> > > > Jun >>> > > > >>> > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com >>> > >>> > > wrote: >>> > > > >>> > > >> Hi, >>> > > >> >>> > > >> This is still very new, but I wanted some quick feedback on a >>> > > preliminary >>> > > >> KIP which could, I think, help with providing an AdminClient API >>> for >>> > > >> partition reassignment. >>> > > >> >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236% >>> > > >> 3A+Interruptible+Partition+Reassignment >>> > > >> >>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient >>> API in >>> > > >> this >>> > > >> KIP (which would make it very big, and difficult to read), or >>> whether >>> > to >>> > > >> break it down into smaller KIPs (which makes it easier to read and >>> > > >> implement in pieces, but harder to get a high-level picture of the >>> > > >> ultimate >>> > > >> destination). For now I've gone for a very small initial KIP, but >>> I'm >>> > > >> happy >>> > > >> to sketch the bigger picture here if people are interested. >>> > > >> >>> > > >> Cheers, >>> > > >> >>> > > >> Tom >>> > > >> >>> > > > >>> > > > >>> > > >>> > >>> > >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote: >>> > >>> > > Another question is on the compatibility. Since now there are 2 ways >>> of >>> > > specifying a partition reassignment, one under >>> /admin/reassign_partitions >>> > > and the other under /admin/reassignments, we probably want to >>> prevent the >>> > > same topic being reassigned under both paths at the same time? >>> > > Thanks, >>> > > >>> > > Jun >>> > > >>> > > >>> > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote: >>> > > >>> > > > Hi, Tom, >>> > > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain points >>> in >>> > > > partition reassignment. Another issue that it also addresses is >>> the ZK >>> > > node >>> > > > size limit when writing the reassignment JSON. >>> > > > >>> > > > My only concern is that the KIP needs to create one watcher per >>> > > reassigned >>> > > > partition. This could add overhead in ZK and complexity for >>> debugging >>> > > when >>> > > > lots of partitions are being reassigned simultaneously. We could >>> > > > potentially improve this by introducing a separate ZK path for >>> change >>> > > > notification as we do for configs. For example, every time we >>> change >>> > the >>> > > > assignment for a set of partitions, we could further write a >>> sequential >>> > > > node /admin/reassignment_changes/[change_x]. That way, the >>> controller >>> > > > only needs to watch the change path. Once a change is triggered, >>> the >>> > > > controller can read everything under /admin/reassignments/. >>> > > > >>> > > > Jun >>> > > > >>> > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com >>> > >>> > > wrote: >>> > > > >>> > > >> Hi, >>> > > >> >>> > > >> This is still very new, but I wanted some quick feedback on a >>> > > preliminary >>> > > >> KIP which could, I think, help with providing an AdminClient API >>> for >>> > > >> partition reassignment. >>> > > >> >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236% >>> > > >> 3A+Interruptible+Partition+Reassignment >>> > > >> >>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient >>> API in >>> > > >> this >>> > > >> KIP (which would make it very big, and difficult to read), or >>> whether >>> > to >>> > > >> break it down into smaller KIPs (which makes it easier to read and >>> > > >> implement in pieces, but harder to get a high-level picture of the >>> > > >> ultimate >>> > > >> destination). For now I've gone for a very small initial KIP, but >>> I'm >>> > > >> happy >>> > > >> to sketch the bigger picture here if people are interested. >>> > > >> >>> > > >> Cheers, >>> > > >> >>> > > >> Tom >>> > > >> >>> > > > >>> > > > >>> > > >>> > >>> >> >> >