Hi George, As Jason was saying, it seems like there are two directions we could go here: an external system handling batching, and the controller handling batching. I think the controller handling batching would be better, since the controller has more information about the state of the system. If the controller handles batching, then the controller could also handle things like setting up replication quotas for individual partitions. The controller could do things like throttle replication down if the cluster was having problems.
We kind of need to figure out which way we're going to go on this one before we set up big new APIs, I think. If we want an external system to handle batching, then we can keep the idea that there is only one reassignment in progress at once. If we want the controller to handle batching, we will need to get away from that idea. Instead, we should just have a bunch of "ideal assignments" that we tell the controller about, and let it decide how to do the batching. These ideal assignments could change continuously over time, so from the admin's point of view, there would be no start/stop/cancel, but just individual partition reassignments that we submit, perhaps over a long period of time. And then cancellation might just mean cancelling just that individual partition reassignment, not all partition reassignments. best, Colin On Fri, Apr 5, 2019, at 19:34, George Li wrote: > Hi Jason / Viktor, > > For the URP during a reassignment, if the "original_replicas" is kept > for the current pending reassignment. I think it will be very easy to > compare that with the topic/partition's ISR. If all > "original_replicas" are in ISR, then URP should be 0 for that > topic/partition. > > It would be also nice to separate the metrics MaxLag/TotalLag for > Reassignments. I think that will also require "original_replicas" (the > topic/partition's replicas just before reassignment when the AR > (Assigned Replicas) is set to Set(original_replicas) + > Set(new_replicas_in_reassign_partitions) ). > > Thanks, > George > > On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson > <ja...@confluent.io> wrote: > > Hi Viktor, > > Thanks for writing this up. As far as questions about overlap with KIP-236, > I agree it seems mostly orthogonal. I think KIP-236 may have had a larger > initial scope, but now it focuses on cancellation and batching is left for > future work. > > With that said, I think we may not actually need a KIP for the current > proposal since it doesn't change any APIs. To make it more generally > useful, however, it would be nice to handle batching at the partition level > as well as Jun suggests. The basic question is at what level should the > batching be determined. You could rely on external processes (e.g. cruise > control) or it could be built into the controller. There are tradeoffs > either way, but I think it simplifies such tools if it is handled > internally. Then it would be much safer to submit a larger reassignment > even just using the simple tools that come with Kafka. > > By the way, since you are looking into some of the reassignment logic, > another problem that we might want to address is the misleading way we > report URPs during a reassignment. I had a naive proposal for this > previously, but it didn't really work > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment. > Potentially fixing that could fall under this work as well if you think > it > makes sense. > > Best, > Jason > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <j...@confluent.io> wrote: > > > Hi, Viktor, > > > > Thanks for the KIP. A couple of comments below. > > > > 1. Another potential thing to do reassignment incrementally is to move a > > batch of partitions at a time, instead of all partitions. This may lead to > > less data replication since by the time the first batch of partitions have > > been completely moved, some data of the next batch may have been deleted > > due to retention and doesn't need to be replicated. > > > > 2. "Update CR in Zookeeper with TR for the given partition". Which ZK path > > is this for? > > > > Jun > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass < > > viktorsomo...@gmail.com> > > wrote: > > > > > Hi Harsha, > > > > > > As far as I understand KIP-236 it's about enabling reassignment > > > cancellation and as a future plan providing a queue of replica > > reassignment > > > steps to allow manual reassignment chains. While I agree that the > > > reassignment chain has a specific use case that allows fine grain control > > > over reassignment process, My proposal on the other hand doesn't talk > > about > > > cancellation but it only provides an automatic way to incrementalize an > > > arbitrary reassignment which I think fits the general use case where > > users > > > don't want that level of control but still would like a balanced way of > > > reassignments. Therefore I think it's still relevant as an improvement of > > > the current algorithm. > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think it would be > > a > > > great improvement to Kafka. > > > > > > Cheers, > > > Viktor > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote: > > > > > > > Hi Viktor, > > > > There is already KIP-236 for the same feature and George > > made > > > > a PR for this as well. > > > > Lets consolidate these two discussions. If you have any cases that are > > > not > > > > being solved by KIP-236 can you please mention them in that thread. We > > > can > > > > address as part of KIP-236. > > > > > > > > Thanks, > > > > Harsha > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote: > > > > > Hi Folks, > > > > > > > > > > I've created a KIP about an improvement of the reassignment algorithm > > > we > > > > > have. It aims to enable partition-wise incremental reassignment. The > > > > > motivation for this is to avoid excess load that the current > > > replication > > > > > algorithm implicitly carries as in that case there are points in the > > > > > algorithm where both the new and old replica set could be online and > > > > > replicating which puts double (or almost double) pressure on the > > > brokers > > > > > which could cause problems. > > > > > Instead my proposal would slice this up into several steps where each > > > > step > > > > > is calculated based on the final target replicas and the current > > > replica > > > > > assignment taking into account scenarios where brokers could be > > offline > > > > and > > > > > when there are not enough replicas to fulfil the min.insync.replica > > > > > requirement. > > > > > > > > > > The link to the KIP: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment > > > > > > > > > > I'd be happy to receive any feedback. > > > > > > > > > > An important note is that this KIP and another one, KIP-236 that is > > > > > about > > > > > interruptible reassignment ( > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment > > > > ) > > > > > should be compatible. > > > > > > > > > > Thanks, > > > > > Viktor > > > > > > > > > > > > > > >