[ 
https://issues.apache.org/jira/browse/SOLR-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028782#comment-18028782
 ] 

Jason Gerlowski commented on SOLR-6213:
---------------------------------------

Saw this recently in Solr 9.7 during a period of ZK misconfiguration.  The 
stacktrace looks a bit different, sharing below:

{code}
o.a.s.c.LeaderElector Failed setting watch => 
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth 
for 
/collections/collNameRedacted/leader_elect/shard1/election/7423158284677299411-core_node9-n_0000000118
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth 
for 
/collections/collNameRedacted/leader_elect/shard1/election/7423158284677299411-core_node9-n_0000000118
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) 
~[?:?]
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:53) ~[?:?]
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972) ~[?:?]
  at 
org.apache.solr.common.cloud.SolrZkClient.lambda$getData$6(SolrZkClient.java:448)
 ~[?:?]
  at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:70)
 ~[?:?]
  at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:448) 
~[?:?]
  at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:164) 
~[?:?]
  at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:164) 
~[?:?]
  at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:164) 
~[?:?]
  .....<300-ish identical lines removed for brevity>
  at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:164) 
~[?:?]
  at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:311) 
~[?:?]
  at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1597) 
~[?:?]
  at org.apache.solr.cloud.ZkController.register(ZkController.java:1303) ~[?:?]
  at org.apache.solr.cloud.ZkController.register(ZkController.java:1237) ~[?:?]
  at 
org.apache.solr.core.ZkContainer.lambda$registerInZk$1(ZkContainer.java:218) 
~[?:?]
  at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
 ~[?:?]
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source) ~[?:?]
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source) ~[?:?]
  at java.base/java.lang.Thread.run(Unknown Source) [?:?]
{code}

The root-cause here seems hard to resolve: we probably need to keep retrying 
replica-registration for cases like transient network-issues, so that Solr can 
get healthy again on its own after the network issue resolves.  But there's a 
lot of room to tweak how the retries happen, to avoid a stackoverflow: looping 
instead of recursion, adding a backoff between retries after the first, say, 
100 attempts, etc.

> StackOverflowException in Solr cloud's leader election
> ------------------------------------------------------
>
>                 Key: SOLR-6213
>                 URL: https://issues.apache.org/jira/browse/SOLR-6213
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10, 6.0
>            Reporter: Dawid Weiss
>            Priority: Critical
>         Attachments: stackoverflow.txt
>
>
> This is what's causing test hangs (at least on FreeBSD, LUCENE-5786), 
> possibly on other machines too. The problem is stack overflow from looped 
> calls in:
> {code}
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:212)
>   > 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>   > 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:313)
>   > org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:221)
>   > 
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:448)
> {code}
> These routines attempt to log information to loggers, which in turn attempts 
> to serialize messages back to the master (test process). When the stack is 
> exhausted the serialization process fails and breaks the communication with 
> the master test node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to