> Should we seek to improve this algorithm in this KIP, or leave that as a later optimisation?
I've updated the KIP with a proposed algorithm. On 14 December 2017 at 09:57, Tom Bentley <t.j.bent...@gmail.com> wrote: > Thanks Ted, now fixed. > > On 13 December 2017 at 18:38, Ted Yu <yuzhih...@gmail.com> wrote: > >> Tom: >> bq. create a znode /admin/reassignments/$topic-$partition >> >> Looks like the tree structure above should be: >> >> /admin/reassignments/$topic/$partition >> >> bq. The controller removes /admin/reassignment/$topic/$partition >> >> Note the lack of 's' for reassignment. It would be good to make zookeeper >> paths consistent. >> >> Thanks >> >> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t.j.bent...@gmail.com> >> wrote: >> >> > Hi Jun and Ted, >> > >> > Jun, you're right that needing one watcher per reassigned partition >> > presents a scalability problem, and using a separate notification path >> > solves that. I also agree that it makes sense to prevent users from >> using >> > both methods on the same reassignment. >> > >> > Ted, naming the reassignments like mytopic-42 was simpler while I was >> > proposing a watcher-per-reassignment (I'd have needed a child watcher on >> > /admin/reassignments and also on /admin/reassignments/mytopic). Using >> the >> > separate notification path means I don't need any watchers in the >> > /admin/reassignments subtree, so switching to >> /admin/reassignments/mytopic/ >> > 42 >> > would work, and avoid /admin/reassignments having a very large number of >> > child nodes. On the other hand it also means I have to create and delete >> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the >> cost >> > of extra round trips to zookeeper. I suppose that since reassignment is >> > generally a slow process it makes little difference if we increase the >> > latency of the interactions with zookeeper. >> > >> > I have updated the KIP with these improvements, and a more detailed >> > description of exactly how we would manage these znodes. >> > >> > Reading the algorithm in KafkaController.onPartitionReassignment(), it >> > seems that it would be suboptimal for changing reassignments in-flight. >> > Consider an initial assignment of [1,2], reassigned to [2,3] and then >> > changed to [2,4]. Broker 3 will remain in the assigned replicas until >> > broker 4 is in sync, even though 3 wasn't actually one of the original >> > assigned replicas and is no longer a new assigned replica. I think this >> > also affects the case where the reassignment is cancelled >> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even >> though >> > its replica will then be deleted. >> > >> > Should we seek to improve this algorithm in this KIP, or leave that as a >> > later optimisation? >> > >> > Cheers, >> > >> > Tom >> > >> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote: >> > >> > > Another question is on the compatibility. Since now there are 2 ways >> of >> > > specifying a partition reassignment, one under >> /admin/reassign_partitions >> > > and the other under /admin/reassignments, we probably want to prevent >> the >> > > same topic being reassigned under both paths at the same time? >> > > Thanks, >> > > >> > > Jun >> > > >> > > >> > > >> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote: >> > > >> > > > Hi, Tom, >> > > > >> > > > Thanks for the KIP. It definitely addresses one of the pain points >> in >> > > > partition reassignment. Another issue that it also addresses is the >> ZK >> > > node >> > > > size limit when writing the reassignment JSON. >> > > > >> > > > My only concern is that the KIP needs to create one watcher per >> > > reassigned >> > > > partition. This could add overhead in ZK and complexity for >> debugging >> > > when >> > > > lots of partitions are being reassigned simultaneously. We could >> > > > potentially improve this by introducing a separate ZK path for >> change >> > > > notification as we do for configs. For example, every time we change >> > the >> > > > assignment for a set of partitions, we could further write a >> sequential >> > > > node /admin/reassignment_changes/[change_x]. That way, the >> controller >> > > > only needs to watch the change path. Once a change is triggered, the >> > > > controller can read everything under /admin/reassignments/. >> > > > >> > > > Jun >> > > > >> > > > >> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com> >> > > wrote: >> > > > >> > > >> Hi, >> > > >> >> > > >> This is still very new, but I wanted some quick feedback on a >> > > preliminary >> > > >> KIP which could, I think, help with providing an AdminClient API >> for >> > > >> partition reassignment. >> > > >> >> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236% >> > > >> 3A+Interruptible+Partition+Reassignment >> > > >> >> > > >> I wasn't sure whether to start fleshing out a whole AdminClient >> API in >> > > >> this >> > > >> KIP (which would make it very big, and difficult to read), or >> whether >> > to >> > > >> break it down into smaller KIPs (which makes it easier to read and >> > > >> implement in pieces, but harder to get a high-level picture of the >> > > >> ultimate >> > > >> destination). For now I've gone for a very small initial KIP, but >> I'm >> > > >> happy >> > > >> to sketch the bigger picture here if people are interested. >> > > >> >> > > >> Cheers, >> > > >> >> > > >> Tom >> > > >> >> > > > >> > > > >> > > >> > >> > >> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote: >> > >> > > Another question is on the compatibility. Since now there are 2 ways >> of >> > > specifying a partition reassignment, one under >> /admin/reassign_partitions >> > > and the other under /admin/reassignments, we probably want to prevent >> the >> > > same topic being reassigned under both paths at the same time? >> > > Thanks, >> > > >> > > Jun >> > > >> > > >> > > >> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote: >> > > >> > > > Hi, Tom, >> > > > >> > > > Thanks for the KIP. It definitely addresses one of the pain points >> in >> > > > partition reassignment. Another issue that it also addresses is the >> ZK >> > > node >> > > > size limit when writing the reassignment JSON. >> > > > >> > > > My only concern is that the KIP needs to create one watcher per >> > > reassigned >> > > > partition. This could add overhead in ZK and complexity for >> debugging >> > > when >> > > > lots of partitions are being reassigned simultaneously. We could >> > > > potentially improve this by introducing a separate ZK path for >> change >> > > > notification as we do for configs. For example, every time we change >> > the >> > > > assignment for a set of partitions, we could further write a >> sequential >> > > > node /admin/reassignment_changes/[change_x]. That way, the >> controller >> > > > only needs to watch the change path. Once a change is triggered, the >> > > > controller can read everything under /admin/reassignments/. >> > > > >> > > > Jun >> > > > >> > > > >> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com> >> > > wrote: >> > > > >> > > >> Hi, >> > > >> >> > > >> This is still very new, but I wanted some quick feedback on a >> > > preliminary >> > > >> KIP which could, I think, help with providing an AdminClient API >> for >> > > >> partition reassignment. >> > > >> >> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236% >> > > >> 3A+Interruptible+Partition+Reassignment >> > > >> >> > > >> I wasn't sure whether to start fleshing out a whole AdminClient >> API in >> > > >> this >> > > >> KIP (which would make it very big, and difficult to read), or >> whether >> > to >> > > >> break it down into smaller KIPs (which makes it easier to read and >> > > >> implement in pieces, but harder to get a high-level picture of the >> > > >> ultimate >> > > >> destination). For now I've gone for a very small initial KIP, but >> I'm >> > > >> happy >> > > >> to sketch the bigger picture here if people are interested. >> > > >> >> > > >> Cheers, >> > > >> >> > > >> Tom >> > > >> >> > > > >> > > > >> > > >> > >> > >