Hey there Viktor, Thanks for working on this KIP! I agree with the notion that reliability, stability and predictability of a reassignment should be a core feature of Kafka.
Let me first explicitly confirm my understanding of the configs and the algorithm: * reassignment.parallel.replica.count - the maximum total number of replicas that we can move at once, *per partition* * reassignment.parallel.partition.count - the maximum number of partitions we can move at once * reassignment.parallel.leader.movements - the maximum number of leader movements we can have at once As far as I currently understand it, your proposed algorithm will naturally prioritize leader movement first. e.g if reassignment.parallel.replica.count=1 and reassignment.parallel.partition.count==reassignment.parallel.leader.movements, we will always move one, the first possible, replica in the replica set (which will be the leader if part of the excess replica set (ER)). Am I correct in saying that? Regarding the KIP, I've got a couple of comments/questions:: 1. Does it make sense to add `max` somewhere in the configs' names? 2. How does this KIP play along with KIP-455's notion of multiple rebalances - do the configs apply on a single AlterPartitionAssignmentsRequest or are they global? 3. Unless I've missed it, the algorithm does not take into account `reassignment.parallel.leader.movements` 4. The KIP says that the order of the input has some control over how the batches are created - i.e it's deterministic. What would the batches of the following reassignment look like: reassignment.parallel.replica.count=1 reassignment.parallel.partition.count=MAX_INT reassignment.parallel.leader.movements=1 partitionA - (0,1,2) -> (3, 4, 5) partitionB - (0,1,2) -> (3,4,5) partitionC - (0,1,2) -> (3, 4, 5) >From my understanding, we would start with A(0->3), B(1->4) and C(1->4). Is that correct? Would the second step then continue with B(0->3)? If the configurations are global, I can imagine we will have a bit more trouble in preserving the expected ordering, especially across controller failovers -- but I'll avoid speculating until you confirm the scope of the config. 5. Regarding the new behavior of electing the new preferred leader in the "first step" of the reassignment - does this obey the `auto.leader.rebalance.enable` config? If not, I have concerns regarding how backwards compatible this might be - e.g imagine a user does a huge reassignment (as they have always done) and suddenly a huge leader shift happens, whereas the user expected to manually shift preferred leaders at a slower rate via the kafka-preferred-replica-election.sh tool. 6. What is the expected behavior if we dynamically change one of the configs to a lower value while a reassignment is happening. Would we cancel some of the currently reassigned partitions or would we account for the new values on the next reassignment? I assume the latter but it's good to be explicit As some small nits: - could we have a section in the KIP where we explicitly define what each config does? This can be inferred from the KIP as is but requires careful reading, whereas some developers might want to skim through the change to get a quick sense. It also improves readability but that's my personal opinion. - could you better clarify how a reassignment step is different from the currently existing algorithm? maybe laying both algorithms out in the KIP would be most clear - the names for the OngoingPartitionReassignment and CurrentPartitionReassignment fields in the ListPartitionReassignmentsResponse are a bit confusing to me. Unfortunately, I don't have a better suggestion, but maybe somebody else in the community has. Thanks, Stanislav On Thu, Jun 27, 2019 at 3:24 PM Viktor Somogyi-Vass <viktorsomo...@gmail.com> wrote: > Hi All, > > I've renamed my KIP as its name was a bit confusing so we'll continue it in > this thread. > The previous thread for record: > > https://lists.apache.org/thread.html/0e97e30271f80540d4da1947bba94832639767e511a87bb2ba530fe7@%3Cdev.kafka.apache.org%3E > > A short summary of the KIP: > In case of a vast partition reassignment (thousands of partitions at once) > Kafka can collapse under the increased replication traffic. This KIP will > mitigate it by introducing internal batching done by the controller. > Besides putting a bandwidth limit on the replication it is useful to batch > partition movements as fewer number of partitions will use the available > bandwidth for reassignment and they finish faster. > The main control handles are: > - the number of parallel leader movements, > - the number of parallel partition movements > - and the number of parallel replica movements. > > Thank you for the feedback and ideas so far in the previous thread and I'm > happy to receive more. > > Regards, > Viktor > -- Best, Stanislav