[jira] [Comment Edited] (SOLR-17331) MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky

Yohann Callea (Jira) Thu, 13 Jun 2024 02:52:52 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854694#comment-17854694
 ]


Yohann Callea edited comment on SOLR-17331 at 6/13/24 9:51 AM:
---------------------------------------------------------------

I am uncertain why this test case is specifically validating the replicas 
balance after the migration by the way.

The positioning of the moved replicas is entirely reliant on the configured 
replica placement strategy implementation, which, in my opinion, is a bit 
outside of the Migrate Replicas command's scope. I would expect this scenario 
(assign multiple replicas of different shards in a single request and expect a 
balanced outcome) to be tested as part of each of the replica placement 
strategies test suites and only validate in the replica migration tests that 
the vacated nodes are empty and that the priorly assigned replicas are now 
assigned to any of the other nodes.

Additionally, I don't think we can assume that the replica placement strategy 
will necessarily optimize the balance by evening out the number of replicas on 
each node. We could imagine having other optimization objectives (e.g., even 
out disk space usage, balancing according to whatever resources usage per 
replica metric, etc.), that may end up not consider the number of replicas per 
node as the main optimization objective. There can also be placement strategies 
that would make the choice to have a lightweight placement strategy (to 
optimize the initial replica placement speed) and accept that the cluster will 
not be perfectly balanced after the initial positioning, then rely on regular 
replica balancing operations to optimize the placements.

 

My thoughts on this test would be :
 * Remove the *_testGoodSpreadDuringAssignWithNoTarget_* case, as the other 
case from this test suite already validates the migration behavior.
 * Maybe randomize some of the number of nodes, the number of shards, the 
replication factor and the replica placement strategy.
 * Move the "assign multiple replicas of different shards in a single request 
and expect a balanced outcome" scenario to a test case for the Simple placement 
strategy (as I don't think such test exist) and eventually fix the logic that 
causes an imbalanced outcome in the Simple placement strategy if we think this 
is not an expected behavior.


was (Author: JIRAUSER303849):
I am quite uncertain why this test case is specifically validating the replicas 
balance after the migration by the way.

The positioning of the moved replicas is entirely reliant on the configured 
replica placement strategy implementation, which, in my opinion, is a bit 
outside of the Migrate Replicas command's scope. I would expect this scenario 
(assign multiple replicas of different shards in a single request and expect a 
balanced outcome) to be tested as part of each of the replica placement 
strategies test suites and only validate in the replica migration tests that 
the vacated nodes are empty and that the priorly assigned replicas are now 
assigned to any of the other nodes.

Additionally, I don't think we can assume that the replica placement strategy 
will necessarily optimize the balance by evening out the number of replicas on 
each node. We could imagine having other optimization objectives (e.g., even 
out disk space usage, balancing according to whatever resources usage per 
replica metric, etc.), that may end up not consider the number of replicas per 
node as the main optimization objective. There can also be placement strategies 
that would make the choice to have a lightweight placement strategy (to 
optimize the initial replica placement speed) and accept that the cluster will 
not be perfectly balanced after the initial positioning, then rely on regular 
replica balancing operations to optimize the placements.

 

My thoughts on this test would be :
 * Remove the *_testGoodSpreadDuringAssignWithNoTarget_* case, as the other 
case from this test suite already validates the migration behavior.
 * Maybe randomize some of the number of nodes, the number of shards, the 
replication factor and the replica placement strategy.
 * Move the "assign multiple replicas of different shards in a single request 
and expect a balanced outcome" scenario to a test case for the Simple placement 
strategy (as I don't think such test exist) and eventually fix the logic that 
causes an imbalanced outcome in the Simple placement strategy if we think this 
is not an expected behavior.

> MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky
> -------------------------------------------------------------------
>
>                 Key: SOLR-17331
>                 URL: https://issues.apache.org/jira/browse/SOLR-17331
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Yohann Callea
>            Priority: Minor
>
> The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is 
> sometimes (< 3% failure rate) failing on its last assertion, as shows the 
> [trend history of test 
> failures|#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget].
>  
> This test spins off a 5 nodes cluster, creates a collection with 3 shards and 
> a replication factor of 2.
> It then vacate 2 randomly chosen nodes using the Migrate Replicas command 
> and, after the migration completion, expect the vacated node to be assigned 
> no replicas and the 6 replicas to be evenly spread across the 3 non-vacated 
> nodes (i.e., 2 replicas positioned on each node).
> However, this last assertion happen to fail as the replicas are sometimes not 
> evenly spread over the 3 non-vacated nodes.
> {code:java}
> The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas 
> after the migration expected:<2> but was:<1> {code}
>  
> If we analyse more in detail a failure situation, it appears that this test 
> is inherently expected to fail under some circumstances, given how the 
> Migrate Replicas command operate.
> When migrating replicas, the new position of the replicas to be moved are 
> calculated sequentially and, for every consecutive move, the position is 
> decided according to the logic implemented by the replica placement plugin 
> currently configured.
> We can therefore end up in the following situation.
> h2. Failing scenario
> Note that this test always uses the default replica placement strategy, which 
> is Simple as of today.
> Let's assume the following initial state, after the collection creation.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |         |         |    X    |         |
> SHARD_2 |         |    X    |         |    X    |         |
> SHARD_3 |         |         |    X    |         |    X    | {code}
> The test now runs the migrate command to vacate *_NODE_3_* and 
> {*}_NODE_4_{*}. It therefore needs to go through 3 replica movements for 
> emptying these two nodes.
> h4. Move 1
> We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}.
> _*NODE_0*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and 
> _*NODE_2*_ can be chosen as they host the same number of replicas.
> *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |         |    X    |         |    X    |         |
> SHARD_3 |         |         |    X    |         |    X    | {code}
> h4. Move 2
> We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}.
> _*NODE_1*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and 
> _*NODE_2*_ can be chosen as they host the same number of replicas.
> *_NODE_0_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |    X    |    X    |         |         |         |
> SHARD_3 |         |         |    X    |         |    X    |{code}
> h4. Move 3
> We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}.
> _*NODE_2*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and 
> _*NODE_1*_ can be chosen as they host the same number of replicas.
> *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |    X    |    X    |         |         |         |
> SHARD_3 |         |    X    |    X    |         |         |{code}
>  
> The test will then fail as the replicas are not evenly positioned across the 
> non-vacated nodes, while it is arguably the expected outcome in the current 
> situation given the Simple placement strategy implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Comment Edited] (SOLR-17331) MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky

Reply via email to