Hi Colin, On a related note, what do you think about the idea of storing the > reassigning replicas in > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in the > reassignment znode? I don't think this requires a major change to the > proposal-- when the controller becomes aware that it should do a > reassignment, the controller could make the changes. This also helps keep > the reassignment znode from getting larger, which has been a problem.
Yeah, I think it's a good idea to store the reassignment state at a finer level. I'm not sure the LeaderAndIsr znode is the right one though. Another option is /brokers/topics/{topic}. That is where we currently store the replica assignment. I think we basically want to represent both the current state and the desired state. This would also open the door to a cleaner way to update a reassignment while it is still in progress. -Jason On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consult...@yahoo.com.invalid> wrote: > Hi Colin / Jason, > > Reassignment should really be doing a batches. I am not too worried about > reassignment znode getting larger. In a real production environment, too > many concurrent reassignment and too frequent submission of reassignments > seemed to cause latency spikes of kafka cluster. So > batching/staggering/throttling of submitting reassignments is recommended. > > In KIP-236, The "originalReplicas" are only kept for the current > reassigning partitions (small #), and kept in memory of the controller > context partitionsBeingReassigned as well as in the znode > /admin/reassign_partitions, I think below "setting in the RPC like null = > no replicas are reassigning" is a good idea. > > There seems to be some issues with the Mail archive server of this mailing > list? I didn't receive email after April 7th, and the archive for April > 2019 has only 50 messages ( > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread) ? > > Thanks, > George > > on, 08 Apr 2019 17:54:48 GMT Colin McCabe wrote: > > Yeah, I think adding this information to LeaderAndIsr makes sense. It > would be better to track > "reassigningReplicas" than "originalReplicas", I think. Tracking > "originalReplicas" is going > to involve sending a lot more data, since most replicas in the system are > not reassigning > at any given point. Or we would need a hack in the RPC like null = no > replicas are reassigning. > > On a related note, what do you think about the idea of storing the > reassigning replicas in > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in > the reassignment znode? > I don't think this requires a major change to the proposal-- when the > controller becomes > aware that it should do a reassignment, the controller could make the > changes. This also > helps keep the reassignment znode from getting larger, which has been a > problem. > > best, > Colin > > > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote: > > Hey George, > > > > For the URP during a reassignment, if the "original_replicas" is kept > for > > > the current pending reassignment. I think it will be very easy to > compare > > > that with the topic/partition's ISR. If all "original_replicas" are in > > > ISR, then URP should be 0 for that topic/partition. > > > > > > Yeah, that makes sense. But I guess we would need "original_replicas" to > be > > propagated to partition leaders in the LeaderAndIsr request since leaders > > are the ones that are computing URPs. That is basically what KIP-352 had > > proposed, but we also need the changes to the reassignment path. Perhaps > it > > makes more sense to address this problem in KIP-236 since that is where > you > > have already introduced "original_replicas"? I'm also happy to do KIP-352 > > as a follow-up to KIP-236. > > > > Best, > > Jason > > > > > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <isma...@gmail.com> wrote: > > > > > Good discussion about where we should do batching. I think if there is > a > > > clear great way to batch, then it makes a lot of sense to just do it > once. > > > However, if we think there is scope for experimenting with different > > > approaches, then an API that tools can use makes a lot of sense. They > can > > > experiment and innovate. Eventually, we can integrate something into > Kafka > > > if it makes sense. > > > > > > Ismael > > > > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > > > > Hi George, > > > > > > > > As Jason was saying, it seems like there are two directions we could > go > > > > here: an external system handling batching, and the controller > handling > > > > batching. I think the controller handling batching would be better, > > > since > > > > the controller has more information about the state of the system. > If > > > the > > > > controller handles batching, then the controller could also handle > things > > > > like setting up replication quotas for individual partitions. The > > > > controller could do things like throttle replication down if the > cluster > > > > was having problems. > > > > > > > > We kind of need to figure out which way we're going to go on this one > > > > before we set up big new APIs, I think. If we want an external > system to > > > > handle batching, then we can keep the idea that there is only one > > > > reassignment in progress at once. If we want the controller to > handle > > > > batching, we will need to get away from that idea. Instead, we > should > > > just > > > > have a bunch of "ideal assignments" that we tell the controller > about, > > > and > > > > let it decide how to do the batching. These ideal assignments could > > > change > > > > continuously over time, so from the admin's point of view, there > would be > > > > no start/stop/cancel, but just individual partition reassignments > that we > > > > submit, perhaps over a long period of time. And then cancellation > might > > > > just mean cancelling just that individual partition reassignment, > not all > > > > partition reassignments. > > > > > > > > best, > > > > Colin > > > > > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote: > > > > > Hi Jason / Viktor, > > > > > > > > > > For the URP during a reassignment, if the "original_replicas" is > kept > > > > > for the current pending reassignment. I think it will be very easy > to > > > > > compare that with the topic/partition's ISR. If all > > > > > "original_replicas" are in ISR, then URP should be 0 for that > > > > > topic/partition. > > > > > > > > > > It would be also nice to separate the metrics MaxLag/TotalLag for > > > > > Reassignments. I think that will also require "original_replicas" > (the > > > > > topic/partition's replicas just before reassignment when the AR > > > > > (Assigned Replicas) is set to Set(original_replicas) + > > > > > Set(new_replicas_in_reassign_partitions) ). > > > > > > > > > > Thanks, > > > > > George > > > > > > > > > > On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson > > > > > <ja...@confluent.io> wrote: > > > > > > > > > > Hi Viktor, > > > > > > > > > > Thanks for writing this up. As far as questions about overlap with > > > > KIP-236, > > > > > I agree it seems mostly orthogonal. I think KIP-236 may have had a > > > larger > > > > > initial scope, but now it focuses on cancellation and batching is > left > > > > for > > > > > future work. > > > > > > > > > > With that said, I think we may not actually need a KIP for the > current > > > > > proposal since it doesn't change any APIs. To make it more > generally > > > > > useful, however, it would be nice to handle batching at the > partition > > > > level > > > > > as well as Jun suggests. The basic question is at what level > should the > > > > > batching be determined. You could rely on external processes (e.g. > > > cruise > > > > > control) or it could be built into the controller. There are > tradeoffs > > > > > either way, but I think it simplifies such tools if it is handled > > > > > internally. Then it would be much safer to submit a larger > reassignment > > > > > even just using the simple tools that come with Kafka. > > > > > > > > > > By the way, since you are looking into some of the reassignment > logic, > > > > > another problem that we might want to address is the misleading > way we > > > > > report URPs during a reassignment. I had a naive proposal for this > > > > > previously, but it didn't really work > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment > > > > . > > > > > Potentially fixing that could fall under this work as well if you > think > > > > > it > > > > > makes sense. > > > > > > > > > > Best, > > > > > Jason > > > > > > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <j...@confluent.io> wrote: > > > > > > > > > > > Hi, Viktor, > > > > > > > > > > > > Thanks for the KIP. A couple of comments below. > > > > > > > > > > > > 1. Another potential thing to do reassignment incrementally is to > > > move > > > > a > > > > > > batch of partitions at a time, instead of all partitions. This > may > > > > lead to > > > > > > less data replication since by the time the first batch of > partitions > > > > have > > > > > > been completely moved, some data of the next batch may have been > > > > deleted > > > > > > due to retention and doesn't need to be replicated. > > > > > > > > > > > > 2. "Update CR in Zookeeper with TR for the given partition". > Which > ZK > > > > path > > > > > > is this for? > > > > > > > > > > > > Jun > > > > > > > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass < > > > > > > viktorsomo...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Harsha, > > > > > > > > > > > > > > As far as I understand KIP-236 it's about enabling reassignment > > > > > > > cancellation and as a future plan providing a queue of replica > > > > > > reassignment > > > > > > > steps to allow manual reassignment chains. While I agree that > the > > > > > > > reassignment chain has a specific use case that allows fine > grain > > > > control > > > > > > > over reassignment process, My proposal on the other hand > doesn't > > > talk > > > > > > about > > > > > > > cancellation but it only provides an automatic way to > > > incrementalize > > > > an > > > > > > > arbitrary reassignment which I think fits the general use case > > > where > > > > > > users > > > > > > > don't want that level of control but still would like a > balanced > > > way > > > > of > > > > > > > reassignments. Therefore I think it's still relevant as an > > > > improvement of > > > > > > > the current algorithm. > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think > it > > > > would be > > > > > > a > > > > > > > great improvement to Kafka. > > > > > > > > > > > > > > Cheers, > > > > > > > Viktor > > > > > > > > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> > wrote: > > > > > > > > > > > > > > > Hi Viktor, > > > > > > > > There is already KIP-236 for the same feature > and > > > George > > > > > > made > > > > > > > > a PR for this as well. > > > > > > > > Lets consolidate these two discussions. If you have any > cases > > > that > > > > are > > > > > > > not > > > > > > > > being solved by KIP-236 can you please mention them in > that > > > > thread. We > > > > > > > can > > > > > > > > address as part of KIP-236. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Harsha > > > > > > > > > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote: > > > > > > > > > Hi Folks, > > > > > > > > > > > > > > > > > > I've created a KIP about an improvement of the reassignment > > > > algorithm > > > > > > > we > > > > > > > > > have. It aims to enable partition-wise incremental > > > reassignment. > > > > The > > > > > > > > > motivation for this is to avoid excess load that the > current > > > > > > > replication > > > > > > > > > algorithm implicitly carries as in that case there > are points > > > in > > > > the > > > > > > > > > algorithm where both the new and old replica set could > be > > > online > > > > and > > > > > > > > > replicating which puts double (or almost double) pressure > on > > > the > > > > > > > brokers > > > > > > > > > which could cause problems. > > > > > > > > > Instead my proposal would slice this up into several > steps > > > where > > > > each > > > > > > > > step > > > > > > > > > is calculated based on the final target replicas and > the > > > current > > > > > > > replica > > > > > > > > > assignment taking into account scenarios where brokers > could be > > > > > > offline > > > > > > > > and > > > > > > > > > when there are not enough replicas to fulfil the > > > > min.insync.replica > > > > > > > > > requirement. > > > > > > > > > > > > > > > > > > The link to the KIP: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment > > > > > > > > > > > > > > > > > > I'd be happy to receive any feedback. > > > > > > > > > > > > > > > > > > An important note is that this KIP and another one, > KIP-236 > > > that > > > > is > > > > > > > > > about > > > > > > > > > interruptible reassignment ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment > > > > > > > > ) > > > > > > > > > should be compatible. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Viktor > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >