[ https://issues.apache.org/jira/browse/SOLR-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Houston Putman reassigned SOLR-17331: ------------------------------------- Assignee: Houston Putman > MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget is flaky > ------------------------------------------------------------------- > > Key: SOLR-17331 > URL: https://issues.apache.org/jira/browse/SOLR-17331 > Project: Solr > Issue Type: Test > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Reporter: Yohann Callea > Assignee: Houston Putman > Priority: Minor > > The test *_MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget_* is > sometimes (< 3% failure rate) failing on its last assertion, as shows the > [trend history of test > failures|#series/org.apache.solr.cloud.MigrateReplicasTest.testGoodSpreadDuringAssignWithNoTarget]. > > This test spins off a 5 nodes cluster, creates a collection with 3 shards and > a replication factor of 2. > It then vacate 2 randomly chosen nodes using the Migrate Replicas command > and, after the migration completion, expect the vacated node to be assigned > no replicas and the 6 replicas to be evenly spread across the 3 non-vacated > nodes (i.e., 2 replicas positioned on each node). > However, this last assertion happen to fail as the replicas are sometimes not > evenly spread over the 3 non-vacated nodes. > {code:java} > The non-source node '127.0.0.1:36007_solr' has the wrong number of replicas > after the migration expected:<2> but was:<1> {code} > > If we analyse more in detail a failure situation, it appears that this test > is inherently expected to fail under some circumstances, given how the > Migrate Replicas command operate. > When migrating replicas, the new position of the replicas to be moved are > calculated sequentially and, for every consecutive move, the position is > decided according to the logic implemented by the replica placement plugin > currently configured. > We can therefore end up in the following situation. > h2. Failing scenario > Note that this test always uses the default replica placement strategy, which > is Simple as of today. > Let's assume the following initial state, after the collection creation. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | | | X | | > SHARD_2 | | X | | X | | > SHARD_3 | | | X | | X | {code} > The test now runs the migrate command to vacate *_NODE_3_* and > {*}_NODE_4_{*}. It therefore needs to go through 3 replica movements for > emptying these two nodes. > h4. Move 1 > We are moving the replica of *_SHARD_1_* positioned on {*}_NODE_3_{*}. > _*NODE_0*_ is not an eligible destination for this replica as this node is > already assigned a replica of {*}_SHARD_1_{*}, and both *_NODE_1_* and > _*NODE_2*_ can be chosen as they host the same number of replicas. > *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination > nodes. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | X | | | | > SHARD_2 | | X | | X | | > SHARD_3 | | | X | | X | {code} > h4. Move 2 > We are moving the replica of *_SHARD_2_* positioned on {*}_NODE_3_{*}. > _*NODE_1*_ is not an eligible destination for this replica as this node is > already assigned a replica of {*}_SHARD_2_{*}, and both *_NODE_0_* and > _*NODE_2*_ can be chosen as they host the same number of replicas. > *_NODE_0_* is arbitrarily chosen amongst the two best candidate destination > nodes. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | X | | | | > SHARD_2 | X | X | | | | > SHARD_3 | | | X | | X |{code} > h4. Move 3 > We are moving the replica of *_SHARD_3_* positioned on {*}_NODE_4_{*}. > _*NODE_2*_ is not an eligible destination for this replica as this node is > already assigned a replica of {*}_SHARD_3_{*}, and both *_NODE_0_* and > _*NODE_1*_ can be chosen as they host the same number of replicas. > *_NODE_1_* is arbitrarily chosen amongst the two best candidate destination > nodes. > {code:java} > | NODE_0 | NODE_1 | NODE_2 | NODE_3 | NODE_4 | > --------+---------+---------+---------+---------+---------+ > SHARD_1 | X | X | | | | > SHARD_2 | X | X | | | | > SHARD_3 | | X | X | | |{code} > > The test will then fail as the replicas are not evenly positioned across the > non-vacated nodes, while it is arguably the expected outcome in the current > situation given the Simple placement strategy implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org