Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Colin McCabe Tue, 25 Jun 2019 12:02:56 -0700

Hi Viktor,

Now that the 2.3 release is over, we're going to be turning our attention back 
to working on KIP-455, which provides an API for partition reassignment, and 
also solves the incremental reassignment problem.  Sorry about the pause, but I 
had to focus on the stuff that was going into 2.3.


I think last time we talked about this, the consensus was that KIP-455 
supersedes KIP-435, since KIP-455 supports incremental reassignment.  We also 
don't want to add more technical debt in the form of a new ZooKeeper-based API 
that we'll have to support for a while.  So let's focus on KIP-455 here.  We 
have more resources now so I think we'll be able to get it done soonish.

best,
Colin


On Tue, Jun 25, 2019, at 08:09, Viktor Somogyi-Vass wrote:
> Hi All,
> 
> I have added another improvement to this, which is to limit the parallel
> leader movements. I think I'll soon (maybe late this week or early next)
> start a vote on this too if there are no additional feedback.
> 
> Thanks,
> Viktor
> 
> On Mon, Apr 29, 2019 at 1:26 PM Viktor Somogyi-Vass <viktorsomo...@gmail.com>
> wrote:
> 
> > Hi Folks,
> >
> > I've updated the KIP with the batching which would work on both replica
> > and partition level. To explain it briefly: for instance if the replica
> > level is set to 2 and partition level is set to 3, then 2x3=6 replica
> > reassignment would be in progress at the same time. In case of reassignment
> > for a single partition from (0, 1, 2, 3, 4) to (5, 6, 7, 8, 9) we would
> > form the batches (0, 1) → (5, 6); (2, 3) → (7, 8) and 4 → 9 and would
> > execute the reassignment in this order.
> >
> > Let me know what you think.
> >
> > Best,
> > Viktor
> >
> > On Mon, Apr 15, 2019 at 7:01 PM Viktor Somogyi-Vass <
> > viktorsomo...@gmail.com> wrote:
> >
> >> A follow up on the batching topic to clarify my points above.
> >>
> >> Generally I think that batching should be a core feature as Colin said
> >> the controller should possess all information that are related.
> >> Also Cruise Control (or really any 3rd party admin system) might build
> >> upon this to give more holistic approach to balance brokers. We may cater
> >> them with APIs that act like building blocks to make their life easier like
> >> incrementalization, batching, cancellation and rollback but I think the
> >> more advanced we go we'll need more advanced control surface and Kafka's
> >> basic tooling might not be suitable for that.
> >>
> >> Best,
> >> Viktor
> >>
> >>
> >> On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <viktorsomo...@gmail.com>
> >> wrote:
> >>
> >>> Hey Guys,
> >>>
> >>> I'll reply to you all in this email:
> >>>
> >>> @Jun:
> >>> 1. yes, it'd be a good idea to add this feature, I'll write this into
> >>> the KIP. I was actually thinking about introducing a dynamic config called
> >>> reassignment.parallel.partition.count and
> >>> reassignment.parallel.replica.count. The first property would control how
> >>> many partition reassignment can we do concurrently. The second would go 
> >>> one
> >>> level in granularity and would control how many replicas do we want to 
> >>> move
> >>> for a given partition. Also one more thing that'd be useful to fix is that
> >>> a given list of partition -> replica list would be executed in the same
> >>> order (from first to last) so it's overall predictable and the user would
> >>> have some control over the order of reassignments should be specified as
> >>> the JSON is still assembled by the user.
> >>> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll update
> >>> the KIP to contain this.
> >>>
> >>> @Jason:
> >>> I think building this functionality into Kafka would definitely benefit
> >>> all the users and that CC as well as it'd simplify their software as you
> >>> said. As I understand the main advantage of CC and other similar softwares
> >>> are to give high level features for automatic load balancing. Reliability,
> >>> stability and predictability of the reassignment should be a core feature
> >>> of Kafka. I think the incrementalization feature would make it more 
> >>> stable.
> >>> I would consider cancellation too as a core feature and we can leave the
> >>> gate open for external tools to feed in their reassignment json as they
> >>> want. I was also thinking about what are the set of features we can 
> >>> provide
> >>> for Kafka but I think the more advanced we go the more need there is for 
> >>> an
> >>> administrative UI component.
> >>> Regarding KIP-352: Thanks for pointing this out, I didn't see this
> >>> although lately I was also thinking about the throttling aspect of it.
> >>> Would be a nice add-on to Kafka since though the above configs provide 
> >>> some
> >>> level of control, it'd be nice to put an upper cap on the bandwidth and
> >>> make it monitorable.
> >>>
> >>> Viktor
> >>>
> >>> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io>
> >>> wrote:
> >>>
> >>>> Hi Colin,
> >>>>
> >>>> On a related note, what do you think about the idea of storing the
> >>>> > reassigning replicas in
> >>>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than
> >>>> in the
> >>>> > reassignment znode?  I don't think this requires a major change to the
> >>>> > proposal-- when the controller becomes aware that it should do a
> >>>> > reassignment, the controller could make the changes.  This also helps
> >>>> keep
> >>>> > the reassignment znode from getting larger, which has been a problem.
> >>>>
> >>>>
> >>>> Yeah, I think it's a good idea to store the reassignment state at a
> >>>> finer
> >>>> level. I'm not sure the LeaderAndIsr znode is the right one though.
> >>>> Another
> >>>> option is /brokers/topics/{topic}. That is where we currently store the
> >>>> replica assignment. I think we basically want to represent both the
> >>>> current
> >>>> state and the desired state. This would also open the door to a cleaner
> >>>> way
> >>>> to update a reassignment while it is still in progress.
> >>>>
> >>>> -Jason
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consult...@yahoo.com
> >>>> .invalid>
> >>>> wrote:
> >>>>
> >>>> >  Hi Colin / Jason,
> >>>> >
> >>>> > Reassignment should really be doing a batches.  I am not too worried
> >>>> about
> >>>> > reassignment znode getting larger.  In a real production
> >>>> environment,  too
> >>>> > many concurrent reassignment and too frequent submission of
> >>>> reassignments
> >>>> > seemed to cause latency spikes of kafka cluster.  So
> >>>> > batching/staggering/throttling of submitting reassignments is
> >>>> recommended.
> >>>> >
> >>>> > In KIP-236,  The "originalReplicas" are only kept for the current
> >>>> > reassigning partitions (small #), and kept in memory of the controller
> >>>> > context partitionsBeingReassigned as well as in the znode
> >>>> > /admin/reassign_partitions,  I think below "setting in the RPC like
> >>>> null =
> >>>> > no replicas are reassigning" is a good idea.
> >>>> >
> >>>> > There seems to be some issues with the Mail archive server of this
> >>>> mailing
> >>>> > list?  I didn't receive email after April 7th, and the archive for
> >>>> April
> >>>> > 2019 has only 50 messages (
> >>>> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
> >>>> ?
> >>>> >
> >>>> > Thanks,
> >>>> > George
> >>>> >
> >>>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
> >>>> >
> >>>> >   Yeah, I think adding this information to LeaderAndIsr makes sense.
> >>>> It
> >>>> > would be better to track
> >>>> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
> >>>> > "originalReplicas" is going
> >>>> > to involve sending a lot more data, since most replicas in the system
> >>>> are
> >>>> > not reassigning
> >>>> > at any given point.  Or we would need a hack in the RPC like null = no
> >>>> > replicas are reassigning.
> >>>> >
> >>>> > On a related note, what do you think about the idea of storing the
> >>>> > reassigning replicas in
> >>>> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than
> >>>> in
> >>>> > the reassignment znode?
> >>>> >  I don't think this requires a major change to the proposal-- when the
> >>>> > controller becomes
> >>>> > aware that it should do a reassignment, the controller could make the
> >>>> > changes.  This also
> >>>> > helps keep the reassignment znode from getting larger, which has been
> >>>> a
> >>>> > problem.
> >>>> >
> >>>> > best,
> >>>> > Colin
> >>>> >
> >>>> >
> >>>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> >>>> > > Hey George,
> >>>> > >
> >>>> > > For the URP during a reassignment,  if the "original_replicas" is
> >>>> kept
> >>>> > for
> >>>> > > > the current pending reassignment. I think it will be very easy to
> >>>> > compare
> >>>> > > > that with the topic/partition's ISR.  If all "original_replicas"
> >>>> are in
> >>>> > > > ISR, then URP should be 0 for that topic/partition.
> >>>> > >
> >>>> > >
> >>>> > > Yeah, that makes sense. But I guess we would need
> >>>> "original_replicas" to
> >>>> > be
> >>>> > > propagated to partition leaders in the LeaderAndIsr request since
> >>>> leaders
> >>>> > > are the ones that are computing URPs. That is basically what
> >>>> KIP-352 had
> >>>> > > proposed, but we also need the changes to the reassignment path.
> >>>> Perhaps
> >>>> > it
> >>>> > > makes more sense to address this problem in KIP-236 since that is
> >>>> where
> >>>> > you
> >>>> > > have already introduced "original_replicas"? I'm also happy to do
> >>>> KIP-352
> >>>> > > as a follow-up to KIP-236.
> >>>> > >
> >>>> > > Best,
> >>>> > > Jason
> >>>> > >
> >>>> > >
> >>>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <isma...@gmail.com>
> >>>> wrote:
> >>>> > >
> >>>> > > > Good discussion about where we should do batching. I think if
> >>>> there is
> >>>> > a
> >>>> > > > clear great way to batch, then it makes a lot of sense to just do
> >>>> it
> >>>> > once.
> >>>> > > > However, if we think there is scope for experimenting with
> >>>> different
> >>>> > > > approaches, then an API that tools can use makes a lot of sense.
> >>>> They
> >>>> > can
> >>>> > > > experiment and innovate. Eventually, we can integrate something
> >>>> into
> >>>> > Kafka
> >>>> > > > if it makes sense.
> >>>> > > >
> >>>> > > > Ismael
> >>>> > > >
> >>>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cmcc...@apache.org>
> >>>> wrote:
> >>>> > > >
> >>>> > > > > Hi George,
> >>>> > > > >
> >>>> > > > > As Jason was saying, it seems like there are two directions we
> >>>> could
> >>>> > go
> >>>> > > > > here: an external system handling batching, and the controller
> >>>> > handling
> >>>> > > > > batching.  I think the controller handling batching would be
> >>>> better,
> >>>> > > > since
> >>>> > > > > the controller has more information about the state of the
> >>>> system.
> >>>> > If
> >>>> > > > the
> >>>> > > > > controller handles batching, then the controller could also
> >>>> handle
> >>>> > things
> >>>> > > > > like setting up replication quotas for individual partitions.
> >>>> The
> >>>> > > > > controller could do things like throttle replication down if the
> >>>> > cluster
> >>>> > > > > was having problems.
> >>>> > > > >
> >>>> > > > > We kind of need to figure out which way we're going to go on
> >>>> this one
> >>>> > > > > before we set up big new APIs, I think.  If we want an external
> >>>> > system to
> >>>> > > > > handle batching, then we can keep the idea that there is only
> >>>> one
> >>>> > > > > reassignment in progress at once.  If we want the controller to
> >>>> > handle
> >>>> > > > > batching, we will need to get away from that idea.  Instead, we
> >>>> > should
> >>>> > > > just
> >>>> > > > > have a bunch of "ideal assignments" that we tell the controller
> >>>> > about,
> >>>> > > > and
> >>>> > > > > let it decide how to do the batching.  These ideal assignments
> >>>> could
> >>>> > > > change
> >>>> > > > > continuously over time, so from the admin's point of view, there
> >>>> > would be
> >>>> > > > > no start/stop/cancel, but just individual partition
> >>>> reassignments
> >>>> > that we
> >>>> > > > > submit, perhaps over a long period of time.  And then
> >>>> cancellation
> >>>> > might
> >>>> > > > > just mean cancelling just that individual partition
> >>>> reassignment,
> >>>> > not all
> >>>> > > > > partition reassignments.
> >>>> > > > >
> >>>> > > > > best,
> >>>> > > > > Colin
> >>>> > > > >
> >>>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> >>>> > > > > >  Hi Jason / Viktor,
> >>>> > > > > >
> >>>> > > > > > For the URP during a reassignment,  if the
> >>>> "original_replicas" is
> >>>> > kept
> >>>> > > > > > for the current pending reassignment. I think it will be very
> >>>> easy
> >>>> > to
> >>>> > > > > > compare that with the topic/partition's ISR.  If all
> >>>> > > > > > "original_replicas" are in ISR, then URP should be 0 for that
> >>>> > > > > > topic/partition.
> >>>> > > > > >
> >>>> > > > > > It would be also nice to separate the metrics MaxLag/TotalLag
> >>>> for
> >>>> > > > > > Reassignments. I think that will also require
> >>>> "original_replicas"
> >>>> > (the
> >>>> > > > > > topic/partition's replicas just before reassignment when the
> >>>> AR
> >>>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
> >>>> > > > > > Set(new_replicas_in_reassign_partitions) ).
> >>>> > > > > >
> >>>> > > > > > Thanks,
> >>>> > > > > > George
> >>>> > > > > >
> >>>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> >>>> > > > > > <ja...@confluent.io> wrote:
> >>>> > > > > >
> >>>> > > > > >  Hi Viktor,
> >>>> > > > > >
> >>>> > > > > > Thanks for writing this up. As far as questions about overlap
> >>>> with
> >>>> > > > > KIP-236,
> >>>> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may have
> >>>> had a
> >>>> > > > larger
> >>>> > > > > > initial scope, but now it focuses on cancellation and
> >>>> batching is
> >>>> > left
> >>>> > > > > for
> >>>> > > > > > future work.
> >>>> > > > > >
> >>>> > > > > > With that said, I think we may not actually need a KIP for the
> >>>> > current
> >>>> > > > > > proposal since it doesn't change any APIs. To make it more
> >>>> > generally
> >>>> > > > > > useful, however, it would be nice to handle batching at the
> >>>> > partition
> >>>> > > > > level
> >>>> > > > > > as well as Jun suggests. The basic question is at what level
> >>>> > should the
> >>>> > > > > > batching be determined. You could rely on external processes
> >>>> (e.g.
> >>>> > > > cruise
> >>>> > > > > > control) or it could be built into the controller. There are
> >>>> > tradeoffs
> >>>> > > > > > either way, but I think it simplifies such tools if it is
> >>>> handled
> >>>> > > > > > internally. Then it would be much safer to submit a larger
> >>>> > reassignment
> >>>> > > > > > even just using the simple tools that come with Kafka.
> >>>> > > > > >
> >>>> > > > > > By the way, since you are looking into some of the
> >>>> reassignment
> >>>> > logic,
> >>>> > > > > > another problem that we might want to address is the
> >>>> misleading
> >>>> > way we
> >>>> > > > > > report URPs during a reassignment. I had a naive proposal for
> >>>> this
> >>>> > > > > > previously, but it didn't really work
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> >
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> >>>> > > > > .
> >>>> > > > > > Potentially fixing that could fall under this work as well if
> >>>> you
> >>>> > think
> >>>> > > > > > it
> >>>> > > > > > makes sense.
> >>>> > > > > >
> >>>> > > > > > Best,
> >>>> > > > > > Jason
> >>>> > > > > >
> >>>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <j...@confluent.io>
> >>>> wrote:
> >>>> > > > > >
> >>>> > > > > > > Hi, Viktor,
> >>>> > > > > > >
> >>>> > > > > > > Thanks for the KIP. A couple of comments below.
> >>>> > > > > > >
> >>>> > > > > > > 1. Another potential thing to do reassignment incrementally
> >>>> is to
> >>>> > > > move
> >>>> > > > > a
> >>>> > > > > > > batch of partitions at a time, instead of all partitions.
> >>>> This
> >>>> > may
> >>>> > > > > lead to
> >>>> > > > > > > less data replication since by the time the first batch of
> >>>> > partitions
> >>>> > > > > have
> >>>> > > > > > > been completely moved, some data of the next batch may have
> >>>> been
> >>>> > > > > deleted
> >>>> > > > > > > due to retention and doesn't need to be replicated.
> >>>> > > > > > >
> >>>> > > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
> >>>> > Which
> >>>> > ZK
> >>>> > > > > path
> >>>> > > > > > > is this for?
> >>>> > > > > > >
> >>>> > > > > > > Jun
> >>>> > > > > > >
> >>>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> >>>> > > > > > > viktorsomo...@gmail.com>
> >>>> > > > > > > wrote:
> >>>> > > > > > >
> >>>> > > > > > > > Hi Harsha,
> >>>> > > > > > > >
> >>>> > > > > > > > As far as I understand KIP-236 it's about enabling
> >>>> reassignment
> >>>> > > > > > > > cancellation and as a future plan providing a queue of
> >>>> replica
> >>>> > > > > > > reassignment
> >>>> > > > > > > > steps to allow manual reassignment chains. While I agree
> >>>> that
> >>>> > the
> >>>> > > > > > > > reassignment chain has a specific use case that allows
> >>>> fine
> >>>> > grain
> >>>> > > > > control
> >>>> > > > > > > > over reassignment process, My proposal on the other hand
> >>>> > doesn't
> >>>> > > > talk
> >>>> > > > > > > about
> >>>> > > > > > > > cancellation but it only provides an automatic way to
> >>>> > > > incrementalize
> >>>> > > > > an
> >>>> > > > > > > > arbitrary reassignment which I think fits the general use
> >>>> case
> >>>> > > > where
> >>>> > > > > > > users
> >>>> > > > > > > > don't want that level of control but still would like a
> >>>> > balanced
> >>>> > > > way
> >>>> > > > > of
> >>>> > > > > > > > reassignments. Therefore I think it's still relevant as an
> >>>> > > > > improvement of
> >>>> > > > > > > > the current algorithm.
> >>>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I
> >>>> think
> >>>> > it
> >>>> > > > > would be
> >>>> > > > > > > a
> >>>> > > > > > > > great improvement to Kafka.
> >>>> > > > > > > >
> >>>> > > > > > > > Cheers,
> >>>> > > > > > > > Viktor
> >>>> > > > > > > >
> >>>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
> >>>> > wrote:
> >>>> > > > > > > >
> >>>> > > > > > > > > Hi Viktor,
> >>>> > > > > > > > >            There is already KIP-236 for the same feature
> >>>> > and
> >>>> > > > George
> >>>> > > > > > > made
> >>>> > > > > > > > > a PR for this as well.
> >>>> > > > > > > > > Lets consolidate these two discussions. If you have any
> >>>> > cases
> >>>> > > > that
> >>>> > > > > are
> >>>> > > > > > > > not
> >>>> > > > > > > > > being solved by KIP-236 can you please mention them in
> >>>> > that
> >>>> > > > > thread. We
> >>>> > > > > > > > can
> >>>> > > > > > > > > address as part of KIP-236.
> >>>> > > > > > > > >
> >>>> > > > > > > > > Thanks,
> >>>> > > > > > > > > Harsha
> >>>> > > > > > > > >
> >>>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass
> >>>> wrote:
> >>>> > > > > > > > > > Hi Folks,
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > I've created a KIP about an improvement of the
> >>>> reassignment
> >>>> > > > > algorithm
> >>>> > > > > > > > we
> >>>> > > > > > > > > > have. It aims to enable partition-wise incremental
> >>>> > > > reassignment.
> >>>> > > > > The
> >>>> > > > > > > > > > motivation for this is to avoid excess load that the
> >>>> > current
> >>>> > > > > > > > replication
> >>>> > > > > > > > > > algorithm implicitly carries as in that case there
> >>>> > are points
> >>>> > > > in
> >>>> > > > > the
> >>>> > > > > > > > > > algorithm where both the new and old replica set could
> >>>> > be
> >>>> > > > online
> >>>> > > > > and
> >>>> > > > > > > > > > replicating which puts double (or almost double)
> >>>> pressure
> >>>> > on
> >>>> > > > the
> >>>> > > > > > > > brokers
> >>>> > > > > > > > > > which could cause problems.
> >>>> > > > > > > > > > Instead my proposal would slice this up into several
> >>>> > steps
> >>>> > > > where
> >>>> > > > > each
> >>>> > > > > > > > > step
> >>>> > > > > > > > > > is calculated based on the final target replicas and
> >>>> > the
> >>>> > > > current
> >>>> > > > > > > > replica
> >>>> > > > > > > > > > assignment taking into account scenarios where brokers
> >>>> > could be
> >>>> > > > > > > offline
> >>>> > > > > > > > > and
> >>>> > > > > > > > > > when there are not enough replicas to fulfil the
> >>>> > > > > min.insync.replica
> >>>> > > > > > > > > > requirement.
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > The link to the KIP:
> >>>> > > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> >
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > I'd be happy to receive any feedback.
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > An important note is that this KIP and another one,
> >>>> > KIP-236
> >>>> > > > that
> >>>> > > > > is
> >>>> > > > > > > > > > about
> >>>> > > > > > > > > > interruptible reassignment (
> >>>> > > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> >
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> >>>> > > > > > > > > )
> >>>> > > > > > > > > > should be compatible.
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > Thanks,
> >>>> > > > > > > > > > Viktor
> >>>> > > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Reply via email to