ArneSchulze opened a new pull request, #18903: URL: https://github.com/apache/kafka/pull/18903
This PR introduces an alternative algorithm to the one used in the `kafka-reassign-partitions` script to minimize the number of replica-movement needed to balance out brokers. It tries to improve on the following properties in order: 1) replica balance across brokers for the cluster 2) replica balance across racks for a topic partition 3) leader balance across brokers for the cluster The algorithm guarantees that the proposed assignments will never worsen properties one and two. Property three is considered for ranking changes, but not enforced. It will modify the initial assignments in two stages, executed in the following order: 1) Assign unassigned replicas 2) Move already assigned replicas In the second stage replicas are moved iteratively between brokers until there is no possible move for any replica. A move is considered possible if at least one of the properties is improved while none of the others is worsened. In each iteration the best possible move is chosen and executed. Moves for replicas that have already been moved in a previous iteration are preferred to other moves. Moving already moved replicas will not increase the overall number of moves required (`1 -> 2` + `2 -> 3` reduces to `1 -> 3`). Moves are ranked by improvement scores. Each move has one improvement score for each of the properties defined above. Improvement scores are compared in the order of priority defined above. An improvement score is computed by subtracting the score of the current state from the score of the hypothetical new state after the move was executed. The score of a state is computed by a slightly modified sum of residual squares computation to avoid floating point arithmetic. For example: The score for `1) replica balance across brokers for the cluster` is $\sum_{b \in brokers} replicaCount(b)^2$. The resulting score will get smaller the smaller the variance of the values. During the first stage, the algorithm will assign all unassigned replicas. An unassigned replica is a replica assigned to a broker not contained in the broker metadata. Possible initial assignments for each replica will be ranked similarly to the ranking while moving replicas. Instead of using the improvement score, the initial assignment solely uses the score of the new state for each property. Therefore, moves that are considered impossible in the move stage can be executed in the initial assignment stage iff there are no preferred alternatives. Each replica assigned is considered moved in the move stage. A new flag arg `--sticky` is added to the `generate` command of the existing `kafka-reassign-partitions` tool. If the flag is present, the tool will use the algorithm defined above. If the flag is not present, the tool will default to the existing algorithm. In a rack-unaware setup, the algorithm will find the replica equilibrium. ###### Limitations The proposed algorithm might not find the optimal solution, because improvement scores for properties one and two must not worsen for any move. Therefore, if `> 1` moves would be necessary to not worsen any improvement scores, the algorithm will not find the solution. Consider the example where we have four brokers spread evenly across two racks (two per rack), one topic partition with two replicas spread across rack one and another topic partition with two replicas spread across rack two. The `1) replica balance across brokers for the cluster` score would be optimal in this situation. There is no move for any replica that would not worsen this property. However, the `2) replica balance across racks for a topic partition` property would be suboptimal for both topic partitions. Swapping one replica of the first topic partition for another replica of the second topic partition would improve property `2`, while not worsening property `1`. This optimal chain of moves can not be executed, because the first move in the chain would require the `1` property to worsen. Even though this limitation exists, we still think that the proposed algorithm would be a valid alternative, since the existing one can not fulfill any of the properties above. Moreover, the existing algorithm reassigns replicas randomly which results in unnecessary replica moves in most cases. ###### Testing strategy The properties of the algorithm are unit tested using `jqwik` (property based testing). Following property based tests are defined: - `1) replica balance across brokers for the cluster` never worsens - `2) replica balance across racks for a topic partition` never worsens - if rack unaware, the algorithm will find a replica broker equilibrium Integration into the `kafka-reassign-partitions` is not tested, since the algorithm is stateless and just a replacement for the existing algorithm in the tool. All other aspects of the script, like acquiring cluster metadata, parsing command opts, etc., stay the same. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is my first contribution to the kafka project. If there are things I'm missing I would be more than happy to address them. Any ideas for improvement on the algorithm, testing, or anything related to the PR are appreciated as well. Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org