[ https://issues.apache.org/jira/browse/SOLR-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854694#comment-17854694 ]
Yohann Callea edited comment on SOLR-17331 at 6/13/24 9:51 AM: --------------------------------------------------------------- I am uncertain why this test case is specifically validating the replicas balance after the migration by the way. The positioning of the moved replicas is entirely reliant on the configured replica placement strategy implementation, which, in my opinion, is a bit outside of the Migrate Replicas command's scope. I would expect this scenario (assign multiple replicas of different shards in a single request and expect a balanced outcome) to be tested as part of each of the replica placement strategies test suites and only validate in the replica migration tests that the vacated nodes are empty and that the priorly assigned replicas are now assigned to any of the other nodes. Additionally, I don't think we can assume that the replica placement strategy will necessarily optimize the balance by evening out the number of replicas on each node. We could imagine having other optimization objectives (e.g., even out disk space usage, balancing according to whatever resources usage per replica metric, etc.), that may end up not consider the number of replicas per node as the main optimization objective. There can also be placement strategies that would make the choice to have a lightweight placement strategy (to optimize the initial replica placement speed) and accept that the cluster will not be perfectly balanced after the initial positioning, then rely on regular replica balancing operations to optimize the placements. My thoughts on this test would be : * Remove the *_testGoodSpreadDuringAssignWithNoTarget_* case, as the other case from this test suite already validates the migration behavior. * Maybe randomize some of the number of nodes, the number of shards, the replication factor and the replica placement strategy. * Move the "assign multiple replicas of different shards in a single request and expect a balanced outcome" scenario to a test case for the Simple placement strategy (as I don't think such test exist) and eventually fix the logic that causes an imbalanced outcome in the Simple placement strategy if we think this is not an expected behavior. was (Author: JIRAUSER303849): I am quite uncertain why this test case is specifically validating the replicas balance after the migration by the way. The positioning of the moved replicas is entirely reliant on the configured replica placement strategy implementation, which, in my opinion, is a bit outside of the Migrate Replicas command's scope. I would expect this scenario (assign multiple replicas of different shards in a single request and expect a balanced outcome) to be tested as part of each of the replica placement strategies test suites and only validate in the replica migration tests that the vacated nodes are empty and that the priorly assigned replicas are now assigned to any of the other nodes. Additionally, I don't think we can assume that the replica placement strategy will necessarily optimize the balance by evening out the number of replicas on each node. We could imagine having other optimization objectives (e.g., even out disk space usage, balancing according to whatever resources usage per replica metric, etc.), that may end up not consider the number of replicas per node as the main optimization objective. There can also be placement strategies that would make the choice to have a lightweight placement strategy (to optimize the initial replica placement speed) and accept that the cluster will not be perfectly balanced after the initial positioning, then rely on regular replica balancing operations to optimize the placements. My thoughts on this test would be : * Remove the *_testGoodSpreadDuringAssignWithNoTarget_* case, as the other case from this test suite already validates the migration behavior. * Maybe randomize some of the number of nodes, the number of shards, the replication factor and the replica placement strategy. * Move the "assign multiple replicas of different shards in a single request and expect a balanced outcome" scenario to a test case for the Simple placement strategy (as I don't think such test exist) and eventually fix the logic that causes an imbalanced outcome in the Simple placement strategy if we think this is not an expected behavior. > MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky > ------------------------------------------------------------------- > > Key: SOLR-17331 > URL: https://issues.apache.org/jira/browse/SOLR-17331 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Reporter: Yohann Callea > Priority: Minor > > The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is > sometimes (< 3% failure rate) failing on its last assertion, as shows the > [trend history of test > failures|#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget]. > > This test spins off a 5 nodes cluster, creates a collection with 3 shards and > a replication factor of 2. > It then vacate 2 randomly chosen nodes using the Migrate Replicas command > and, after the migration completion, expect the vacated node to be assigned > no replicas and the 6 replicas to be evenly spread across the 3 non-vacated > nodes (i.e., 2 replicas positioned on each node). > However, this last assertion happen to fail as the replicas are sometimes not > evenly spread over the 3 non-vacated nodes. > {code:java} > The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas > after the migration expected:<2> but was:<1> {code} > > If we analyse more in detail a failure situation, it appears that this test > is inherently expected to fail under some circumstances, given how the > Migrate Replicas command operate. > When migrating replicas, the new position of the replicas to be moved are > calculated sequentially and, for every consecutive move, the position is > decided according to the logic implemented by the replica placement plugin > currently configured. > We can therefore end up in the following situation. > h2. Failing scenario > Note that this test always uses the default replica placement strategy, which > is Simple as of today. > Let's assume the following initial state, after the collection creation. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | | | X | | > SHARD_2 | | X | | X | | > SHARD_3 | | | X | | X | {code} > The test now runs the migrate command to vacate *_NODE_3_* and > {*}_NODE_4_{*}. It therefore needs to go through 3 replica movements for > emptying these two nodes. > h4. Move 1 > We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}. > _*NODE_0*_ is not an eligible destination for this replica as this node is > already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and > _*NODE_2*_ can be chosen as they host the same number of replicas. > *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination > nodes. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | X | | | | > SHARD_2 | | X | | X | | > SHARD_3 | | | X | | X | {code} > h4. Move 2 > We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}. > _*NODE_1*_ is not an eligible destination for this replica as this node is > already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and > _*NODE_2*_ can be chosen as they host the same number of replicas. > *_NODE_0_* is arbitrarily chosen amongst the two best candidate destination > nodes. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | X | | | | > SHARD_2 | X | X | | | | > SHARD_3 | | | X | | X |{code} > h4. Move 3 > We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}. > _*NODE_2*_ is not an eligible destination for this replica as this node is > already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and > _*NODE_1*_ can be chosen as they host the same number of replicas. > *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination > nodes. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | X | | | | > SHARD_2 | X | X | | | | > SHARD_3 | | X | X | | |{code} > > The test will then fail as the replicas are not evenly positioned across the > non-vacated nodes, while it is arguably the expected outcome in the current > situation given the Simple placement strategy implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org