Hi Jun,

Thanks for replying, some answers below:


> 10. The proposal now stores the reassignment for all partitions in
> /admin/reassignment_requests/request_xxx. If the number of reassigned
> partitions is larger, the ZK write may hit the default 1MB limit and fail.
> An alternative approach is to have the reassignment requester first write
> the new assignment for each partition under
> /admin/reassignments/$topic/$partition and then write
> /admin/reassignment_requests/request_xxx with an empty value. The
> controller can then read all values under /admin/reassignments.
>

You're right that reassigning enough partitions would hit the 1MB limit,
but I don't think this would be a problem in practice because it would be
trivial to split the partitions into several requests (i.e. mutleiple
request_xxx).
I don't think the non-atomicity this would imply is a problem. By writing
the partitions whose /admin/reassignments/$topic/$partition has been
created or changed it makes it much more efficient to know which of those
znodes we need to read. If I understand your suggestion, you would have to
read every node under /admin/reassignments to figure out which had changed.


> 11. The improvement you suggested in onPartitionReassignment() sounds good
> at the high level. The computation of those dropped partitions seems a bit
> complicated. Perhaps a simple approach is to drop the replicas not in the
> original assignment and newest reassignment?
>

This was what I came up with originally too, but when I looked into
implementing it I found a couple of things which made me reconsider it.
Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we start
reassigning to [2,3], but then change our minds about 2 and switch it to 4
(maybe we've figured out a better overall balance). At that point it is
perfectly possible that broker 2 is in-sync and broker 1 is not in-sync. It
seems silly to drop broker 2 in favour of broker 1: We're needlessly giving
the cluster more work to do.

The second thing that made me reconsider was in that same scenario it's
even possible that broker 2 is the leader of the partition. Obviously we
can elect a new leader before dropping it, but not without causing
disruption to producers and consumers.

By accepting a little more complexity in choosing which brokers to drop we
make the dropping simpler (no need for leader election) and ensure the
cluster has less work to do.


> 12. You brought up the need of remembering the original assignment. This
> will be lost if the assignment is changed multiple times if we follow the
> approach described in 10. One way is to store the original assignment in
> /brokers/topics/[topic] as the following. When the final reassignment
> completes, we can remove the original field.
> {
>   "version": 1,
>   "partitions": {"0": [0, 1, 3] },
>   "originals": {"0": [0, 1, 2] }
> }
>

While I was implementing my first version of onPartitionReassignment(),
where I preferred the originals, I was storing the originals in the
/admin/reassignments/$topic/$partition znodes. Since we will remove that
znode at the end of reassignment anyway, I would suggest this is a better
place to store that data (if it's necessary to do so), so that we can avoid
another ZK round trip.


> 13. For resolving the conflict between /admin/reassign_partitions and
> /admin/reassignments/$topic/$partition, perhaps it's more natural to just
> let the assignment with a newer timestamp to override the older one?
>

That would work but with slightly different semantics to what I have: Since
/admin/reassign_partitions contains multiple partitions, using the
timestamp means the whole batch wins or losses. By tracking how each
request was made we can be more fine-grained. I'm to use the modification
time if such granularity is not required.


> 14. Implementation wise, currently, we register a watcher of the isr path
> of each partition being reassigned. This has the potential issue of
> registering many listeners. An improvement could be just piggybacking on
> the existing IsrChangeNotificationHandler, which only watches a single ZK
> path and is triggered on a batch of isr changes. This is kind of orthogonal
> to the KIP. However, if we are touching the reassignment logic, it may be
> worth considering.


Let me look into that.

Thanks,

Tom

On 16 December 2017 at 02:19, Jun Rao <j...@confluent.io> wrote:

> Hi, Tom,
>
> Thanks for the updated KIP. A few more comments below.
>
> 10. The proposal now stores the reassignment for all partitions in
> /admin/reassignment_requests/request_xxx. If the number of reassigned
> partitions is larger, the ZK write may hit the default 1MB limit and fail.
> An alternative approach is to have the reassignment requester first write
> the new assignment for each partition under
> /admin/reassignments/$topic/$partition and then write
> /admin/reassignment_requests/request_xxx with an empty value. The
> controller can then read all values under /admin/reassignments.
>
> 11. The improvement you suggested in onPartitionReassignment() sounds good
> at the high level. The computation of those dropped partitions seems a bit
> complicated. Perhaps a simple approach is to drop the replicas not in the
> original assignment and newest reassignment?
>
> 12. You brought up the need of remembering the original assignment. This
> will be lost if the assignment is changed multiple times if we follow the
> approach described in 10. One way is to store the original assignment in
> /brokers/topics/[topic] as the following. When the final reassignment
> completes, we can remove the original field.
> {
>   "version": 1,
>   "partitions": {"0": [0, 1, 3] },
>   "originals": {"0": [0, 1, 2] }
> }
>
> 13. For resolving the conflict between /admin/reassign_partitions and
> /admin/reassignments/$topic/$partition, perhaps it's more natural to just
> let the assignment with a newer timestamp to override the older one?
>
> 14. Implementation wise, currently, we register a watcher of the isr path
> of each partition being reassigned. This has the potential issue of
> registering many listeners. An improvement could be just piggybacking on
> the existing IsrChangeNotificationHandler, which only watches a single ZK
> path and is triggered on a batch of isr changes. This is kind of orthogonal
> to the KIP. However, if we are touching the reassignment logic, it may be
> worth considering.
>
> Thanks,
>
> Jun
>
> On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t.j.bent...@gmail.com>
> wrote:
>
> > Just wanted to mention that I've started KIP-240, which builds on top of
> > this one to provide an AdminClient API for listing and describing
> > reassignments.
> >
> > On 15 December 2017 at 14:34, Tom Bentley <t.j.bent...@gmail.com> wrote:
> >
> > > > Should we seek to improve this algorithm in this KIP, or leave that
> as
> > > a later optimisation?
> > >
> > > I've updated the KIP with a proposed algorithm.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 14 December 2017 at 09:57, Tom Bentley <t.j.bent...@gmail.com>
> wrote:
> > >
> > >> Thanks Ted, now fixed.
> > >>
> > >> On 13 December 2017 at 18:38, Ted Yu <yuzhih...@gmail.com> wrote:
> > >>
> > >>> Tom:
> > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > >>>
> > >>> Looks like the tree structure above should be:
> > >>>
> > >>> /admin/reassignments/$topic/$partition
> > >>>
> > >>> bq. The controller removes /admin/reassignment/$topic/$partition
> > >>>
> > >>> Note the lack of 's' for reassignment. It would be good to make
> > zookeeper
> > >>> paths consistent.
> > >>>
> > >>> Thanks
> > >>>
> > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t.j.bent...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Hi Jun and Ted,
> > >>> >
> > >>> > Jun, you're right that needing one watcher per reassigned partition
> > >>> > presents a scalability problem, and using a separate notification
> > path
> > >>> > solves that. I also agree that it makes sense to prevent users from
> > >>> using
> > >>> > both methods on the same reassignment.
> > >>> >
> > >>> > Ted, naming the reassignments like mytopic-42 was simpler while I
> was
> > >>> > proposing a watcher-per-reassignment (I'd have needed a child
> watcher
> > >>> on
> > >>> > /admin/reassignments and also on /admin/reassignments/mytopic).
> Using
> > >>> the
> > >>> > separate notification path means I don't need any watchers in the
> > >>> > /admin/reassignments subtree, so switching to
> > >>> /admin/reassignments/mytopic/
> > >>> > 42
> > >>> > would work, and avoid /admin/reassignments having a very large
> number
> > >>> of
> > >>> > child nodes. On the other hand it also means I have to create and
> > >>> delete
> > >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs
> the
> > >>> cost
> > >>> > of extra round trips to zookeeper. I suppose that since
> reassignment
> > is
> > >>> > generally a slow process it makes little difference if we increase
> > the
> > >>> > latency of the interactions with zookeeper.
> > >>> >
> > >>> > I have updated the KIP with these improvements, and a more detailed
> > >>> > description of exactly how we would manage these znodes.
> > >>> >
> > >>> > Reading the algorithm in KafkaController.
> onPartitionReassignment(),
> > it
> > >>> > seems that it would be suboptimal for changing reassignments
> > in-flight.
> > >>> > Consider an initial assignment of [1,2], reassigned to [2,3] and
> then
> > >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas
> until
> > >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> > original
> > >>> > assigned replicas and is no longer a new assigned replica. I think
> > this
> > >>> > also affects the case where the reassignment is cancelled
> > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up,
> even
> > >>> though
> > >>> > its replica will then be deleted.
> > >>> >
> > >>> > Should we seek to improve this algorithm in this KIP, or leave that
> > as
> > >>> a
> > >>> > later optimisation?
> > >>> >
> > >>> > Cheers,
> > >>> >
> > >>> > Tom
> > >>> >
> > >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
> > >>> >
> > >>> > > Another question is on the compatibility. Since now there are 2
> > ways
> > >>> of
> > >>> > > specifying a partition reassignment, one under
> > >>> /admin/reassign_partitions
> > >>> > > and the other under /admin/reassignments, we probably want to
> > >>> prevent the
> > >>> > > same topic being reassigned under both paths at the same time?
> > >>> > > Thanks,
> > >>> > >
> > >>> > > Jun
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io>
> wrote:
> > >>> > >
> > >>> > > > Hi, Tom,
> > >>> > > >
> > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > points
> > >>> in
> > >>> > > > partition reassignment. Another issue that it also addresses is
> > >>> the ZK
> > >>> > > node
> > >>> > > > size limit when writing the reassignment JSON.
> > >>> > > >
> > >>> > > > My only concern is that the KIP needs to create one watcher per
> > >>> > > reassigned
> > >>> > > > partition. This could add overhead in ZK and complexity for
> > >>> debugging
> > >>> > > when
> > >>> > > > lots of partitions are being reassigned simultaneously. We
> could
> > >>> > > > potentially improve this by introducing a separate ZK path for
> > >>> change
> > >>> > > > notification as we do for configs. For example, every time we
> > >>> change
> > >>> > the
> > >>> > > > assignment for a set of partitions, we could further write a
> > >>> sequential
> > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > >>> controller
> > >>> > > > only needs to watch the change path. Once a change is
> triggered,
> > >>> the
> > >>> > > > controller can read everything under /admin/reassignments/.
> > >>> > > >
> > >>> > > > Jun
> > >>> > > >
> > >>> > > >
> > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > t.j.bent...@gmail.com
> > >>> >
> > >>> > > wrote:
> > >>> > > >
> > >>> > > >> Hi,
> > >>> > > >>
> > >>> > > >> This is still very new, but I wanted some quick feedback on a
> > >>> > > preliminary
> > >>> > > >> KIP which could, I think, help with providing an AdminClient
> API
> > >>> for
> > >>> > > >> partition reassignment.
> > >>> > > >>
> > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > >>> > > >>
> > >>> > > >> I wasn't sure whether to start fleshing out a whole
> AdminClient
> > >>> API in
> > >>> > > >> this
> > >>> > > >> KIP (which would make it very big, and difficult to read), or
> > >>> whether
> > >>> > to
> > >>> > > >> break it down into smaller KIPs (which makes it easier to read
> > and
> > >>> > > >> implement in pieces, but harder to get a high-level picture of
> > the
> > >>> > > >> ultimate
> > >>> > > >> destination). For now I've gone for a very small initial KIP,
> > but
> > >>> I'm
> > >>> > > >> happy
> > >>> > > >> to sketch the bigger picture here if people are interested.
> > >>> > > >>
> > >>> > > >> Cheers,
> > >>> > > >>
> > >>> > > >> Tom
> > >>> > > >>
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>> >
> > >>> > On 11 December 2017 at 21:31, Jun Rao <j...@confluent.io> wrote:
> > >>> >
> > >>> > > Another question is on the compatibility. Since now there are 2
> > ways
> > >>> of
> > >>> > > specifying a partition reassignment, one under
> > >>> /admin/reassign_partitions
> > >>> > > and the other under /admin/reassignments, we probably want to
> > >>> prevent the
> > >>> > > same topic being reassigned under both paths at the same time?
> > >>> > > Thanks,
> > >>> > >
> > >>> > > Jun
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <j...@confluent.io>
> wrote:
> > >>> > >
> > >>> > > > Hi, Tom,
> > >>> > > >
> > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > points
> > >>> in
> > >>> > > > partition reassignment. Another issue that it also addresses is
> > >>> the ZK
> > >>> > > node
> > >>> > > > size limit when writing the reassignment JSON.
> > >>> > > >
> > >>> > > > My only concern is that the KIP needs to create one watcher per
> > >>> > > reassigned
> > >>> > > > partition. This could add overhead in ZK and complexity for
> > >>> debugging
> > >>> > > when
> > >>> > > > lots of partitions are being reassigned simultaneously. We
> could
> > >>> > > > potentially improve this by introducing a separate ZK path for
> > >>> change
> > >>> > > > notification as we do for configs. For example, every time we
> > >>> change
> > >>> > the
> > >>> > > > assignment for a set of partitions, we could further write a
> > >>> sequential
> > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > >>> controller
> > >>> > > > only needs to watch the change path. Once a change is
> triggered,
> > >>> the
> > >>> > > > controller can read everything under /admin/reassignments/.
> > >>> > > >
> > >>> > > > Jun
> > >>> > > >
> > >>> > > >
> > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > t.j.bent...@gmail.com
> > >>> >
> > >>> > > wrote:
> > >>> > > >
> > >>> > > >> Hi,
> > >>> > > >>
> > >>> > > >> This is still very new, but I wanted some quick feedback on a
> > >>> > > preliminary
> > >>> > > >> KIP which could, I think, help with providing an AdminClient
> API
> > >>> for
> > >>> > > >> partition reassignment.
> > >>> > > >>
> > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > >>> > > >>
> > >>> > > >> I wasn't sure whether to start fleshing out a whole
> AdminClient
> > >>> API in
> > >>> > > >> this
> > >>> > > >> KIP (which would make it very big, and difficult to read), or
> > >>> whether
> > >>> > to
> > >>> > > >> break it down into smaller KIPs (which makes it easier to read
> > and
> > >>> > > >> implement in pieces, but harder to get a high-level picture of
> > the
> > >>> > > >> ultimate
> > >>> > > >> destination). For now I've gone for a very small initial KIP,
> > but
> > >>> I'm
> > >>> > > >> happy
> > >>> > > >> to sketch the bigger picture here if people are interested.
> > >>> > > >>
> > >>> > > >> Cheers,
> > >>> > > >>
> > >>> > > >> Tom
> > >>> > > >>
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Reply via email to