> Should we seek to improve this algorithm in this KIP, or leave that as a
later optimisation?

I've updated the KIP with a proposed algorithm.






On 14 December 2017 at 09:57, Tom Bentley <t.j.bent...@gmail.com> wrote:

> Thanks Ted, now fixed.
>
> On 13 December 2017 at 18:38, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Tom:
>> bq. create a znode /admin/reassignments/$topic-$partition
>>
>> Looks like the tree structure above should be:
>>
>> /admin/reassignments/$topic/$partition
>>
>> bq. The controller removes /admin/reassignment/$topic/$partition
>>
>> Note the lack of 's' for reassignment. It would be good to make zookeeper
>> paths consistent.
>>
>> Thanks
>>
>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t.j.bent...@gmail.com>
>> wrote:
>>
>> > Hi Jun and Ted,
>> >
>> > Jun, you're right that needing one watcher per reassigned partition
>> > presents a scalability problem, and using a separate notification path
>> > solves that. I also agree that it makes sense to prevent users from
>> using
>> > both methods on the same reassignment.
>> >
>> > Ted, naming the reassignments like mytopic-42 was simpler while I was
>> > proposing a watcher-per-reassignment (I'd have needed a child watcher on
>> > /admin/reassignments and also on /admin/reassignments/mytopic). Using
>> the
>> > separate notification path means I don't need any watchers in the
>> > /admin/reassignments subtree, so switching to
>> /admin/reassignments/mytopic/
>> > 42
>> > would work, and avoid /admin/reassignments having a very large number of
>> > child nodes. On the other hand it also means I have to create and delete
>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the
>> cost
>> > of extra round trips to zookeeper. I suppose that since reassignment is
>> > generally a slow process it makes little difference if we increase the
>> > latency of the interactions with zookeeper.
>> >
>> > I have updated the KIP with these improvements, and a more detailed
>> > description of exactly how we would manage these znodes.
>> >
>> > Reading the algorithm in KafkaController.onPartitionReassignment(), it
>> > seems that it would be suboptimal for changing reassignments in-flight.
>> > Consider an initial assignment of [1,2], reassigned to [2,3] and then
>> > changed to [2,4]. Broker 3 will remain in the assigned replicas until
>> > broker 4 is in sync, even though 3 wasn't actually one of the original
>> > assigned replicas and is no longer a new assigned replica. I think this
>> > also affects the case where the reassignment is cancelled
>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even
>> though
>> > its replica will then be deleted.
>> >
>> > Should we seek to improve this algorithm in this KIP, or leave that as a
>> > later optimisation?
>> >
>> > Cheers,
>> >
>> > Tom
>> >
>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
>> >
>> > > Another question is on the compatibility. Since now there are 2 ways
>> of
>> > > specifying a partition reassignment, one under
>> /admin/reassign_partitions
>> > > and the other under /admin/reassignments, we probably want to prevent
>> the
>> > > same topic being reassigned under both paths at the same time?
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
>> > >
>> > > > Hi, Tom,
>> > > >
>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>> in
>> > > > partition reassignment. Another issue that it also addresses is the
>> ZK
>> > > node
>> > > > size limit when writing the reassignment JSON.
>> > > >
>> > > > My only concern is that the KIP needs to create one watcher per
>> > > reassigned
>> > > > partition. This could add overhead in ZK and complexity for
>> debugging
>> > > when
>> > > > lots of partitions are being reassigned simultaneously. We could
>> > > > potentially improve this by introducing a separate ZK path for
>> change
>> > > > notification as we do for configs. For example, every time we change
>> > the
>> > > > assignment for a set of partitions, we could further write a
>> sequential
>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>> controller
>> > > > only needs to watch the change path. Once a change is triggered, the
>> > > > controller can read everything under /admin/reassignments/.
>> > > >
>> > > > Jun
>> > > >
>> > > >
>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com>
>> > > wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> This is still very new, but I wanted some quick feedback on a
>> > > preliminary
>> > > >> KIP which could, I think, help with providing an AdminClient API
>> for
>> > > >> partition reassignment.
>> > > >>
>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>> > > >> 3A+Interruptible+Partition+Reassignment
>> > > >>
>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>> API in
>> > > >> this
>> > > >> KIP (which would make it very big, and difficult to read), or
>> whether
>> > to
>> > > >> break it down into smaller KIPs (which makes it easier to read and
>> > > >> implement in pieces, but harder to get a high-level picture of the
>> > > >> ultimate
>> > > >> destination). For now I've gone for a very small initial KIP, but
>> I'm
>> > > >> happy
>> > > >> to sketch the bigger picture here if people are interested.
>> > > >>
>> > > >> Cheers,
>> > > >>
>> > > >> Tom
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>> >
>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
>> >
>> > > Another question is on the compatibility. Since now there are 2 ways
>> of
>> > > specifying a partition reassignment, one under
>> /admin/reassign_partitions
>> > > and the other under /admin/reassignments, we probably want to prevent
>> the
>> > > same topic being reassigned under both paths at the same time?
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
>> > >
>> > > > Hi, Tom,
>> > > >
>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>> in
>> > > > partition reassignment. Another issue that it also addresses is the
>> ZK
>> > > node
>> > > > size limit when writing the reassignment JSON.
>> > > >
>> > > > My only concern is that the KIP needs to create one watcher per
>> > > reassigned
>> > > > partition. This could add overhead in ZK and complexity for
>> debugging
>> > > when
>> > > > lots of partitions are being reassigned simultaneously. We could
>> > > > potentially improve this by introducing a separate ZK path for
>> change
>> > > > notification as we do for configs. For example, every time we change
>> > the
>> > > > assignment for a set of partitions, we could further write a
>> sequential
>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>> controller
>> > > > only needs to watch the change path. Once a change is triggered, the
>> > > > controller can read everything under /admin/reassignments/.
>> > > >
>> > > > Jun
>> > > >
>> > > >
>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com>
>> > > wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> This is still very new, but I wanted some quick feedback on a
>> > > preliminary
>> > > >> KIP which could, I think, help with providing an AdminClient API
>> for
>> > > >> partition reassignment.
>> > > >>
>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>> > > >> 3A+Interruptible+Partition+Reassignment
>> > > >>
>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>> API in
>> > > >> this
>> > > >> KIP (which would make it very big, and difficult to read), or
>> whether
>> > to
>> > > >> break it down into smaller KIPs (which makes it easier to read and
>> > > >> implement in pieces, but harder to get a high-level picture of the
>> > > >> ultimate
>> > > >> destination). For now I've gone for a very small initial KIP, but
>> I'm
>> > > >> happy
>> > > >> to sketch the bigger picture here if people are interested.
>> > > >>
>> > > >> Cheers,
>> > > >>
>> > > >> Tom
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to