Hi Jun and Ted,

Jun, you're right that needing one watcher per reassigned partition
presents a scalability problem, and using a separate notification path
solves that. I also agree that it makes sense to prevent users from using
both methods on the same reassignment.

Ted, naming the reassignments like mytopic-42 was simpler while I was
proposing a watcher-per-reassignment (I'd have needed a child watcher on
/admin/reassignments and also on /admin/reassignments/mytopic). Using the
separate notification path means I don't need any watchers in the
/admin/reassignments subtree, so switching to /admin/reassignments/mytopic/42
would work, and avoid /admin/reassignments having a very large number of
child nodes. On the other hand it also means I have to create and delete
the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the cost
of extra round trips to zookeeper. I suppose that since reassignment is
generally a slow process it makes little difference if we increase the
latency of the interactions with zookeeper.

I have updated the KIP with these improvements, and a more detailed
description of exactly how we would manage these znodes.

Reading the algorithm in KafkaController.onPartitionReassignment(), it
seems that it would be suboptimal for changing reassignments in-flight.
Consider an initial assignment of [1,2], reassigned to [2,3] and then
changed to [2,4]. Broker 3 will remain in the assigned replicas until
broker 4 is in sync, even though 3 wasn't actually one of the original
assigned replicas and is no longer a new assigned replica. I think this
also affects the case where the reassignment is cancelled
([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even though
its replica will then be deleted.

Should we seek to improve this algorithm in this KIP, or leave that as a
later optimisation?

Cheers,

Tom

On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:

> Another question is on the compatibility. Since now there are 2 ways of
> specifying a partition reassignment, one under /admin/reassign_partitions
> and the other under /admin/reassignments, we probably want to prevent the
> same topic being reassigned under both paths at the same time?
> Thanks,
>
> Jun
>
>
>
> On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the KIP. It definitely addresses one of the pain points in
> > partition reassignment. Another issue that it also addresses is the ZK
> node
> > size limit when writing the reassignment JSON.
> >
> > My only concern is that the KIP needs to create one watcher per
> reassigned
> > partition. This could add overhead in ZK and complexity for debugging
> when
> > lots of partitions are being reassigned simultaneously. We could
> > potentially improve this by introducing a separate ZK path for change
> > notification as we do for configs. For example, every time we change the
> > assignment for a set of partitions, we could further write a sequential
> > node /admin/reassignment_changes/[change_x]. That way, the controller
> > only needs to watch the change path. Once a change is triggered, the
> > controller can read everything under /admin/reassignments/.
> >
> > Jun
> >
> >
> > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> This is still very new, but I wanted some quick feedback on a
> preliminary
> >> KIP which could, I think, help with providing an AdminClient API for
> >> partition reassignment.
> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >> 3A+Interruptible+Partition+Reassignment
> >>
> >> I wasn't sure whether to start fleshing out a whole AdminClient API in
> >> this
> >> KIP (which would make it very big, and difficult to read), or whether to
> >> break it down into smaller KIPs (which makes it easier to read and
> >> implement in pieces, but harder to get a high-level picture of the
> >> ultimate
> >> destination). For now I've gone for a very small initial KIP, but I'm
> >> happy
> >> to sketch the bigger picture here if people are interested.
> >>
> >> Cheers,
> >>
> >> Tom
> >>
> >
> >
>


On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:

> Another question is on the compatibility. Since now there are 2 ways of
> specifying a partition reassignment, one under /admin/reassign_partitions
> and the other under /admin/reassignments, we probably want to prevent the
> same topic being reassigned under both paths at the same time?
> Thanks,
>
> Jun
>
>
>
> On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the KIP. It definitely addresses one of the pain points in
> > partition reassignment. Another issue that it also addresses is the ZK
> node
> > size limit when writing the reassignment JSON.
> >
> > My only concern is that the KIP needs to create one watcher per
> reassigned
> > partition. This could add overhead in ZK and complexity for debugging
> when
> > lots of partitions are being reassigned simultaneously. We could
> > potentially improve this by introducing a separate ZK path for change
> > notification as we do for configs. For example, every time we change the
> > assignment for a set of partitions, we could further write a sequential
> > node /admin/reassignment_changes/[change_x]. That way, the controller
> > only needs to watch the change path. Once a change is triggered, the
> > controller can read everything under /admin/reassignments/.
> >
> > Jun
> >
> >
> > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> This is still very new, but I wanted some quick feedback on a
> preliminary
> >> KIP which could, I think, help with providing an AdminClient API for
> >> partition reassignment.
> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >> 3A+Interruptible+Partition+Reassignment
> >>
> >> I wasn't sure whether to start fleshing out a whole AdminClient API in
> >> this
> >> KIP (which would make it very big, and difficult to read), or whether to
> >> break it down into smaller KIPs (which makes it easier to read and
> >> implement in pieces, but harder to get a high-level picture of the
> >> ultimate
> >> destination). For now I've gone for a very small initial KIP, but I'm
> >> happy
> >> to sketch the bigger picture here if people are interested.
> >>
> >> Cheers,
> >>
> >> Tom
> >>
> >
> >
>

Reply via email to