Just wanted to mention that I've started KIP-240, which builds on top of
this one to provide an AdminClient API for listing and describing
reassignments.

On 15 December 2017 at 14:34, Tom Bentley <t.j.bent...@gmail.com> wrote:

> > Should we seek to improve this algorithm in this KIP, or leave that as
> a later optimisation?
>
> I've updated the KIP with a proposed algorithm.
>
>
>
>
>
>
> On 14 December 2017 at 09:57, Tom Bentley <t.j.bent...@gmail.com> wrote:
>
>> Thanks Ted, now fixed.
>>
>> On 13 December 2017 at 18:38, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Tom:
>>> bq. create a znode /admin/reassignments/$topic-$partition
>>>
>>> Looks like the tree structure above should be:
>>>
>>> /admin/reassignments/$topic/$partition
>>>
>>> bq. The controller removes /admin/reassignment/$topic/$partition
>>>
>>> Note the lack of 's' for reassignment. It would be good to make zookeeper
>>> paths consistent.
>>>
>>> Thanks
>>>
>>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t.j.bent...@gmail.com>
>>> wrote:
>>>
>>> > Hi Jun and Ted,
>>> >
>>> > Jun, you're right that needing one watcher per reassigned partition
>>> > presents a scalability problem, and using a separate notification path
>>> > solves that. I also agree that it makes sense to prevent users from
>>> using
>>> > both methods on the same reassignment.
>>> >
>>> > Ted, naming the reassignments like mytopic-42 was simpler while I was
>>> > proposing a watcher-per-reassignment (I'd have needed a child watcher
>>> on
>>> > /admin/reassignments and also on /admin/reassignments/mytopic). Using
>>> the
>>> > separate notification path means I don't need any watchers in the
>>> > /admin/reassignments subtree, so switching to
>>> /admin/reassignments/mytopic/
>>> > 42
>>> > would work, and avoid /admin/reassignments having a very large number
>>> of
>>> > child nodes. On the other hand it also means I have to create and
>>> delete
>>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the
>>> cost
>>> > of extra round trips to zookeeper. I suppose that since reassignment is
>>> > generally a slow process it makes little difference if we increase the
>>> > latency of the interactions with zookeeper.
>>> >
>>> > I have updated the KIP with these improvements, and a more detailed
>>> > description of exactly how we would manage these znodes.
>>> >
>>> > Reading the algorithm in KafkaController.onPartitionReassignment(), it
>>> > seems that it would be suboptimal for changing reassignments in-flight.
>>> > Consider an initial assignment of [1,2], reassigned to [2,3] and then
>>> > changed to [2,4]. Broker 3 will remain in the assigned replicas until
>>> > broker 4 is in sync, even though 3 wasn't actually one of the original
>>> > assigned replicas and is no longer a new assigned replica. I think this
>>> > also affects the case where the reassignment is cancelled
>>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even
>>> though
>>> > its replica will then be deleted.
>>> >
>>> > Should we seek to improve this algorithm in this KIP, or leave that as
>>> a
>>> > later optimisation?
>>> >
>>> > Cheers,
>>> >
>>> > Tom
>>> >
>>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
>>> >
>>> > > Another question is on the compatibility. Since now there are 2 ways
>>> of
>>> > > specifying a partition reassignment, one under
>>> /admin/reassign_partitions
>>> > > and the other under /admin/reassignments, we probably want to
>>> prevent the
>>> > > same topic being reassigned under both paths at the same time?
>>> > > Thanks,
>>> > >
>>> > > Jun
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
>>> > >
>>> > > > Hi, Tom,
>>> > > >
>>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>>> in
>>> > > > partition reassignment. Another issue that it also addresses is
>>> the ZK
>>> > > node
>>> > > > size limit when writing the reassignment JSON.
>>> > > >
>>> > > > My only concern is that the KIP needs to create one watcher per
>>> > > reassigned
>>> > > > partition. This could add overhead in ZK and complexity for
>>> debugging
>>> > > when
>>> > > > lots of partitions are being reassigned simultaneously. We could
>>> > > > potentially improve this by introducing a separate ZK path for
>>> change
>>> > > > notification as we do for configs. For example, every time we
>>> change
>>> > the
>>> > > > assignment for a set of partitions, we could further write a
>>> sequential
>>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>>> controller
>>> > > > only needs to watch the change path. Once a change is triggered,
>>> the
>>> > > > controller can read everything under /admin/reassignments/.
>>> > > >
>>> > > > Jun
>>> > > >
>>> > > >
>>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com
>>> >
>>> > > wrote:
>>> > > >
>>> > > >> Hi,
>>> > > >>
>>> > > >> This is still very new, but I wanted some quick feedback on a
>>> > > preliminary
>>> > > >> KIP which could, I think, help with providing an AdminClient API
>>> for
>>> > > >> partition reassignment.
>>> > > >>
>>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>>> > > >> 3A+Interruptible+Partition+Reassignment
>>> > > >>
>>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>>> API in
>>> > > >> this
>>> > > >> KIP (which would make it very big, and difficult to read), or
>>> whether
>>> > to
>>> > > >> break it down into smaller KIPs (which makes it easier to read and
>>> > > >> implement in pieces, but harder to get a high-level picture of the
>>> > > >> ultimate
>>> > > >> destination). For now I've gone for a very small initial KIP, but
>>> I'm
>>> > > >> happy
>>> > > >> to sketch the bigger picture here if people are interested.
>>> > > >>
>>> > > >> Cheers,
>>> > > >>
>>> > > >> Tom
>>> > > >>
>>> > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
>>> >
>>> > > Another question is on the compatibility. Since now there are 2 ways
>>> of
>>> > > specifying a partition reassignment, one under
>>> /admin/reassign_partitions
>>> > > and the other under /admin/reassignments, we probably want to
>>> prevent the
>>> > > same topic being reassigned under both paths at the same time?
>>> > > Thanks,
>>> > >
>>> > > Jun
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io> wrote:
>>> > >
>>> > > > Hi, Tom,
>>> > > >
>>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>>> in
>>> > > > partition reassignment. Another issue that it also addresses is
>>> the ZK
>>> > > node
>>> > > > size limit when writing the reassignment JSON.
>>> > > >
>>> > > > My only concern is that the KIP needs to create one watcher per
>>> > > reassigned
>>> > > > partition. This could add overhead in ZK and complexity for
>>> debugging
>>> > > when
>>> > > > lots of partitions are being reassigned simultaneously. We could
>>> > > > potentially improve this by introducing a separate ZK path for
>>> change
>>> > > > notification as we do for configs. For example, every time we
>>> change
>>> > the
>>> > > > assignment for a set of partitions, we could further write a
>>> sequential
>>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>>> controller
>>> > > > only needs to watch the change path. Once a change is triggered,
>>> the
>>> > > > controller can read everything under /admin/reassignments/.
>>> > > >
>>> > > > Jun
>>> > > >
>>> > > >
>>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bent...@gmail.com
>>> >
>>> > > wrote:
>>> > > >
>>> > > >> Hi,
>>> > > >>
>>> > > >> This is still very new, but I wanted some quick feedback on a
>>> > > preliminary
>>> > > >> KIP which could, I think, help with providing an AdminClient API
>>> for
>>> > > >> partition reassignment.
>>> > > >>
>>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>>> > > >> 3A+Interruptible+Partition+Reassignment
>>> > > >>
>>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>>> API in
>>> > > >> this
>>> > > >> KIP (which would make it very big, and difficult to read), or
>>> whether
>>> > to
>>> > > >> break it down into smaller KIPs (which makes it easier to read and
>>> > > >> implement in pieces, but harder to get a high-level picture of the
>>> > > >> ultimate
>>> > > >> destination). For now I've gone for a very small initial KIP, but
>>> I'm
>>> > > >> happy
>>> > > >> to sketch the bigger picture here if people are interested.
>>> > > >>
>>> > > >> Cheers,
>>> > > >>
>>> > > >> Tom
>>> > > >>
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Reply via email to