[ 
https://issues.apache.org/jira/browse/SOLR-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yohann Callea updated SOLR-17331:
---------------------------------
    Description: 
The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is 
sometimes (< 3% failure rate) failing on its last assertion, as shows the 
[trend history of test 
failures|#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget].

 

This test spins off a 5 nodes cluster, creates a collection with 3 shards and a 
replication factor of 2.

It then vacate 2 randomly chosen nodes using the Migrate Replicas command and, 
after the migration completion, expect the vacated node to be assigned no 
replicas and the 6 replicas to be evenly spread across the 3 non-vacated nodes 
(i.e., 2 replicas positioned on each node).

However, this last assertion happen to fail as the replicas are sometimes not 
evenly spread over the 3 non-vacated nodes.
{code:java}
The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas 
after the migration expected:<2> but was:<1> {code}
 

If we analyse more in detail a failure situation, it appears that this test is 
inherently expected to fail under some circumstances, given how the Migrate 
Replicas command operate.

When migrating replicas, the new position of the replicas to be moved are 
calculated sequentially and, for every consecutive move, the position is 
decided according to the logic implemented by the replica placement plugin 
currently configured.

We can therefore end up in the following situation.
h2. Failing scenario

Note that this test always uses the default replica placement strategy, which 
is Simple as of today.

Let's assume the following initial state, after the collection creation.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |         |         |    X    |         |
SHARD_2 |         |    X    |         |    X    |         |
SHARD_3 |         |         |    X    |         |    X    | {code}
The test now runs the migrate command to vacate *_NODE_3_* and {*}_NODE_4_{*}. 
It therefore needs to go through 3 replica movements for emptying these two 
nodes.
h4. Move 1

We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}.

_*NODE_0*_ is not an eligible destination for this replica as this node is 
already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and 
_*NODE_2*_ can be chosen as they host the same number of replicas.

*_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
nodes.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |    X    |         |         |         |
SHARD_2 |         |    X    |         |    X    |         |
SHARD_3 |         |         |    X    |         |    X    | {code}
h4. Move 2

We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}.

_*NODE_1*_ is not an eligible destination for this replica as this node is 
already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and 
_*NODE_2*_ can be chosen as they host the same number of replicas.

*_NODE_0_* is arbitrarily chosen amongst the two best candidate destination 
nodes.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |    X    |         |         |         |
SHARD_2 |    X    |    X    |         |         |         |
SHARD_3 |         |         |    X    |         |    X    |{code}
h4. Move 3

We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}.

_*NODE_2*_ is not an eligible destination for this replica as this node is 
already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and 
_*NODE_1*_ can be chosen as they host the same number of replicas.

*_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
nodes.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |    X    |         |         |         |
SHARD_2 |    X    |    X    |         |         |         |
SHARD_3 |         |    X    |    X    |         |         |{code}
 

The test will then fail as the replicas are not evenly positioned across the 
non-vacated nodes, while it is arguably the expected outcome in the current 
situation given the Simple placement strategy implementation.

  was:
The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is 
sometimes (< 3% failure rate) failing on its last assertion, as shows the 
[trend history of test 
failures|#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget].]

 

This test spins off a 5 nodes cluster, creates a collection with 3 shards and a 
replication factor of 2.

It then vacate 2 randomly chosen nodes using the Migrate Replicas command and, 
after the migration completion, expect the vacated node to be assigned no 
replicas and the 6 replicas to be evenly spread across the 3 non-vacated nodes 
(i.e., 2 replicas positioned on each node).

However, this last assertion happen to fail as the replicas are sometimes not 
evenly spread over the 3 non-vacated nodes.
{code:java}
The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas 
after the migration expected:<2> but was:<1> {code}
 

If we analyse more in detail a failure situation, it appears that this test is 
inherently expected to fail under some circumstances, given how the Migrate 
Replicas command operate.

When migrating replicas, the new position of the replicas to be moved are 
calculated sequentially and, for every consecutive move, the position is 
decided according to the logic implemented by the replica placement plugin 
currently configured.

We can therefore end up in the following situation.
h2. Failing scenario

Note that this test always uses the default replica placement strategy, which 
is Simple as of today.

Let's assume the following initial state, after the collection creation.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |         |         |    X    |         |
SHARD_2 |         |    X    |         |    X    |         |
SHARD_3 |         |         |    X    |         |    X    | {code}
The test now runs the migrate command to vacate *_NODE_3_* and {*}_NODE_4_{*}. 
It therefore needs to go through 3 replica movements for emptying these two 
nodes.
h4. Move 1

We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}.

_*NODE_0*_ is not an eligible destination for this replica as this node is 
already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and 
_*NODE_2*_ can be chosen as they host the same number of replicas.

*_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
nodes.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |    X    |         |         |         |
SHARD_2 |         |    X    |         |    X    |         |
SHARD_3 |         |         |    X    |         |    X    | {code}
h4. Move 2

We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}.

_*NODE_1*_ is not an eligible destination for this replica as this node is 
already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and 
_*NODE_2*_ can be chosen as they host the same number of replicas.

*_NODE_0_* is arbitrarily chosen amongst the two best candidate destination 
nodes.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |    X    |         |         |         |
SHARD_2 |    X    |    X    |         |         |         |
SHARD_3 |         |         |    X    |         |    X    |{code}
h4. Move 3

We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}.

_*NODE_2*_ is not an eligible destination for this replica as this node is 
already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and 
_*NODE_1*_ can be chosen as they host the same number of replicas.

*_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
nodes.
{code:java}
        |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
--------+---------+---------+---------+---------+---------+
SHARD_1 |    X    |    X    |         |         |         |
SHARD_2 |    X    |    X    |         |         |         |
SHARD_3 |         |    X    |    X    |         |         |{code}
 

The test will then fail as the replicas are not evenly positioned across the 
non-vacated nodes, while it is arguably the expected outcome in the current 
situation given the Simple placement strategy implementation.


> MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky
> -------------------------------------------------------------------
>
>                 Key: SOLR-17331
>                 URL: https://issues.apache.org/jira/browse/SOLR-17331
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Yohann Callea
>            Priority: Minor
>
> The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is 
> sometimes (< 3% failure rate) failing on its last assertion, as shows the 
> [trend history of test 
> failures|#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget].
>  
> This test spins off a 5 nodes cluster, creates a collection with 3 shards and 
> a replication factor of 2.
> It then vacate 2 randomly chosen nodes using the Migrate Replicas command 
> and, after the migration completion, expect the vacated node to be assigned 
> no replicas and the 6 replicas to be evenly spread across the 3 non-vacated 
> nodes (i.e., 2 replicas positioned on each node).
> However, this last assertion happen to fail as the replicas are sometimes not 
> evenly spread over the 3 non-vacated nodes.
> {code:java}
> The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas 
> after the migration expected:<2> but was:<1> {code}
>  
> If we analyse more in detail a failure situation, it appears that this test 
> is inherently expected to fail under some circumstances, given how the 
> Migrate Replicas command operate.
> When migrating replicas, the new position of the replicas to be moved are 
> calculated sequentially and, for every consecutive move, the position is 
> decided according to the logic implemented by the replica placement plugin 
> currently configured.
> We can therefore end up in the following situation.
> h2. Failing scenario
> Note that this test always uses the default replica placement strategy, which 
> is Simple as of today.
> Let's assume the following initial state, after the collection creation.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |         |         |    X    |         |
> SHARD_2 |         |    X    |         |    X    |         |
> SHARD_3 |         |         |    X    |         |    X    | {code}
> The test now runs the migrate command to vacate *_NODE_3_* and 
> {*}_NODE_4_{*}. It therefore needs to go through 3 replica movements for 
> emptying these two nodes.
> h4. Move 1
> We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}.
> _*NODE_0*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and 
> _*NODE_2*_ can be chosen as they host the same number of replicas.
> *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |         |    X    |         |    X    |         |
> SHARD_3 |         |         |    X    |         |    X    | {code}
> h4. Move 2
> We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}.
> _*NODE_1*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and 
> _*NODE_2*_ can be chosen as they host the same number of replicas.
> *_NODE_0_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |    X    |    X    |         |         |         |
> SHARD_3 |         |         |    X    |         |    X    |{code}
> h4. Move 3
> We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}.
> _*NODE_2*_ is not an eligible destination for this replica as this node is 
> already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and 
> _*NODE_1*_ can be chosen as they host the same number of replicas.
> *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination 
> nodes.
> {code:java}
>         |  NODE_0 |  NODE_1 |  NODE_2 |  NODE_3 |  NODE_4 |
> --------+---------+---------+---------+---------+---------+
> SHARD_1 |    X    |    X    |         |         |         |
> SHARD_2 |    X    |    X    |         |         |         |
> SHARD_3 |         |    X    |    X    |         |         |{code}
>  
> The test will then fail as the replicas are not evenly positioned across the 
> non-vacated nodes, while it is arguably the expected outcome in the current 
> situation given the Simple placement strategy implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to