ArneSchulze opened a new pull request, #18903:
URL: https://github.com/apache/kafka/pull/18903

   This PR introduces an alternative algorithm to the one used in the 
`kafka-reassign-partitions` script to minimize the number of replica-movement 
needed to balance out brokers.
   It tries to improve on the following properties in order:
   
   1) replica balance across brokers for the cluster
   2) replica balance across racks for a topic partition
   3) leader balance across brokers for the cluster
   
   The algorithm guarantees that the proposed assignments will never worsen 
properties one and two.
   Property three is considered for ranking changes, but not enforced.
   It will modify the initial assignments in two stages, executed in the 
following order:
   
   1) Assign unassigned replicas
   2) Move already assigned replicas
   
   In the second stage replicas are moved iteratively between brokers until 
there is no possible move for any replica.
   A move is considered possible if at least one of the properties is improved 
while none of the others is worsened.
   
   In each iteration the best possible move is chosen and executed.
   Moves for replicas that have already been moved in a previous iteration are 
preferred to other moves.
   Moving already moved replicas will not increase the overall number of moves 
required (`1 -> 2` + `2 -> 3` reduces to `1 -> 3`).
   Moves are ranked by improvement scores.
   Each move has one improvement score for each of the properties defined above.
   Improvement scores are compared in the order of priority defined above.
   
   An improvement score is computed by subtracting the score of the current 
state from the score of the hypothetical new state after the move was executed.
   The score of a state is computed by a slightly modified sum of residual 
squares computation to avoid floating point arithmetic.
   For example: The score for `1) replica balance across brokers for the 
cluster` is $\sum_{b \in brokers} replicaCount(b)^2$.
   The resulting score will get smaller the smaller the variance of the values.
   
   During the first stage, the algorithm will assign all unassigned replicas.
   An unassigned replica is a replica assigned to a broker not contained in the 
broker metadata.
   Possible initial assignments for each replica will be ranked similarly to 
the ranking while moving replicas.
   Instead of using the improvement score, the initial assignment solely uses 
the score of the new state for each property.
   Therefore, moves that are considered impossible in the move stage can be 
executed in the initial assignment stage iff there are no preferred 
alternatives.
   Each replica assigned is considered moved in the move stage.
   
   A new flag arg `--sticky` is added to the `generate` command of the existing 
`kafka-reassign-partitions` tool.
   If the flag is present, the tool will use the algorithm defined above.
   If the flag is not present, the tool will default to the existing algorithm.
   
   In a rack-unaware setup, the algorithm will find the replica equilibrium.
   
   ###### Limitations
   
   The proposed algorithm might not find the optimal solution, because 
improvement scores for properties one and two must not worsen for any move.
   Therefore, if `> 1` moves would be necessary to not worsen any improvement 
scores, the algorithm will not find the solution.
   
   Consider the example where we have four brokers spread evenly across two 
racks (two per rack), one topic partition with two replicas
   spread across rack one and another topic partition with two replicas spread 
across rack two.
   The `1) replica balance across brokers for the cluster` score would be 
optimal in this situation.
   There is no move for any replica that would not worsen this property.
   However, the `2) replica balance across racks for a topic partition` 
property would be suboptimal for both topic partitions.
   Swapping one replica of the first topic partition for another replica of the 
second topic partition would improve property `2`, while
   not worsening property `1`.
   This optimal chain of moves can not be executed, because the first move in 
the chain would require the `1` property to worsen.
   
   Even though this limitation exists, we still think that the proposed 
algorithm would be a valid alternative, since the existing one can not fulfill 
any of
   the properties above.
   Moreover, the existing algorithm reassigns replicas randomly which results 
in unnecessary replica moves in most cases.
   
   ###### Testing strategy
   
   The properties of the algorithm are unit tested using `jqwik` (property 
based testing). Following property based tests are defined:
   
   - `1) replica balance across brokers for the cluster` never worsens
   - `2) replica balance across racks for a topic partition` never worsens
   - if rack unaware, the algorithm will find a replica broker equilibrium
   
   Integration into the `kafka-reassign-partitions` is not tested, since the 
algorithm is stateless and just a replacement for the existing algorithm in the 
tool.
   All other aspects of the script, like acquiring cluster metadata, parsing 
command opts, etc., stay the same.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
   This is my first contribution to the kafka project.
   If there are things I'm missing I would be more than happy to address them.
   Any ideas for improvement on the algorithm, testing, or anything related to 
the PR are appreciated as well. Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to