[
https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man reopened SOLR-13445:
-----------------------------
jenkins has found at least 2 problems with the new
RoutingToNodesWithPropertiesTest class...
[https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-8.x-Linux/536/]
----
First: a reproducing failing seed (on branch_8x)...
{noformat}
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=RoutingToNodesWithPropertiesTest -Dtests.method=test
-Dtests.seed=13525A4073A0EB3F -Dtests.multiplier=3 -Dtests.slow=true
-Dtests.locale=zh-HK -Dtests.timezone=Brazil/Acre -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] FAILURE 0.45s J1 | RoutingToNodesWithPropertiesTest.test <<<
[junit4] > Throwable #1: java.lang.AssertionError: Hitting same zone
after 10 queries
[junit4] > at
__randomizedtesting.SeedInfo.seed([13525A4073A0EB3F:9B06659ADD5C86C7]:0)
[junit4] > at
org.apache.solr.cloud.RoutingToNodesWithPropertiesTest.test(RoutingToNodesWithPropertiesTest.java:251)
[junit4] > at java.lang.Thread.run(Thread.java:748)
{noformat}
At a glance, the problem seems to be that the test assumes if it tries a query
10 times, at least one of those queries is will hit 2 nodes in different
"zones" – but there's no guarantee of that, it's pure dumb luck – it's like
having a test that calls {{random().nextInt(2)}} in a loop 10 times and asserts
that it got a value of "0" at least iteration ... it's statistically going to
fail some fixed percentage of time.
----
Second: when jenkins tries to reproduce the seed, it runs with
{{-Dtests.dups=5}} but this causes an initialization failure in the BeforeClass
method ... i'm not certain, but at a glance I'm guessing this is because of
static variables that aren't being cleaned up in the AfterClass method?
{noformat}
[junit4] ERROR 0.00s J2 | RoutingToNodesWithPropertiesTest (suite) <<<
[junit4] > Throwable #1: java.lang.AssertionError: expected:<us-west1>
but was:<null>
[junit4] > at
__randomizedtesting.SeedInfo.seed([13525A4073A0EB3F]:0)
[junit4] > at
org.apache.solr.cloud.RoutingToNodesWithPropertiesTest.setupCluster(RoutingToNodesWithPropertiesTest.java:115)
[junit4] > at java.lang.Thread.run(Thread.java:748)
{noformat}
> Preferred replicas on nodes with same system properties as the query master
> ---------------------------------------------------------------------------
>
> Key: SOLR-13445
> URL: https://issues.apache.org/jira/browse/SOLR-13445
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Cao Manh Dat
> Assignee: Cao Manh Dat
> Priority: Major
> Fix For: master (9.0), 8.2
>
> Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch
>
>
> Currently, Solr chooses a random replica for each shard to fan out the query
> request. However, this presents a problem when running Solr in multiple
> availability zones.
> If one availability zone fails then it affects all Solr nodes because they
> will try to connect to Solr nodes in the failed availability zone until the
> request times out. This can lead to a build up of threads on each Solr node
> until the node goes out of memory. This results in a cascading failure.
> This issue try to solve this problem by adding
> * another shardPreference param named {{node.sysprop}}, so the query will be
> routed to nodes with same defined system properties as the current one.
> * default shardPreferences on the whole cluster, which will be stored in
> {{/clusterprops.json}}.
> * a cacher for fetching other nodes system properties whenever /live_nodes
> get changed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]