[ https://issues.apache.org/jira/browse/SOLR-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pierre Salagnac reassigned SOLR-17405: -------------------------------------- Assignee: Pierre Salagnac > Zookeeper session can be re-established by multiple threads concurrently > ------------------------------------------------------------------------ > > Key: SOLR-17405 > URL: https://issues.apache.org/jira/browse/SOLR-17405 > Project: Solr > Issue Type: Bug > Affects Versions: 8.11, 9.6 > Reporter: Pierre Salagnac > Assignee: Pierre Salagnac > Priority: Major > Labels: pull-request-available > Attachments: stack.png > > Time Spent: 50m > Remaining Estimate: 0h > > Because of a bug in SolrCloud, the Zookeeper session can be re-established by > multiple threads concurrently when an expiration occurs. > This portion of the code assumes it is mono-threaded. Because of the bug, the > last thread re-establishing the session can waif for 30 seconds per core, > waiting for it to be marked {{DOWN}} while it was previously marked > {{ACTIVE}} by another thread. With a high number of cores, the Solr server > can hang for hours before taking traffic again. > Following exception shows two threads were reestablishing the session > concurrently. {{ZkController.createEphemeralLiveNode()}} should never be > invoked twice for the same Zookeeper session. > {code:java} > thrown: java. lang.RuntimeException: > org.apache.solr.common.cloud.ZooKeeperException: > at > org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:178) > at org.apache.solr. common.cloud.DefaultConnectionStrategy. > reconnect(DefaultConnectionStrategy.java:57) > at > org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:152) > at org.apache. > zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:535) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.solr.common.cloud.ZooKeeperException: > at > org.apache.solr.cloud.ZkController$1.command(ZkController.java:462) > at > org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:170) > ... 4 more > Caused by: org.apache. zookeeper.KeeperException$NodeExistsException. > KeeperErrorCode = NodeExists > at org.apache.zookeeper.KeeperException.create(KeeperException.java: > 126) > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1925) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1830) > at > org.apache.solr.common.cloud.SolrZkClient.lambda$multi$11(SolrZkClient.java:666) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retry0peration(ZkCmdExecutor.java:71) > at > org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:666) > at org.apache.sol.cloud.ZkController > CreateEphemeralLiveNode(ZkController.java:1086) > at > org.apache.solr.cloud.ZkController$1.command(ZkController.java:411) > ... 5 more {code} > h2. Root cause > This bug occurs because several threads can re-establish the session > concurrently. > It cannot happen at the first expiration of the session, thanks to a thread > pool with a single thread to execute the zookeeper Watcher. > Bellow is a code snippet from class {{SolrZkClient.ProcessWatchWithExecutor}} > {code:java} > if (watcher instanceof ConnectionManager) { > zkConnManagerCallbackExecutor.submit(() -> watcher.process(event)); > } else { > ....... > } > {code} > Using this dedicated thread pool (with a single thread) is supposed to ensure > we don’t handle watches for connection related events with multiple threads. > This works well for the first session expiration. > Now, when we re-establish the session after the first expiration, we don’t > use this wrapper to register the watch. > It is done directly in {{ConnectionManager}} without wrapping the ZK watch. > In the following snippet, _“this”_ is the ZK watcher instance, but it is not > wrapper to use a {{{}ProcessWatchWithExecutor{}}}. This means the next events > will directly be handled by any ZK callback thread. > {code:java} > connectionStrategy.reconnect(zkServerAddress,client.getZkClientTimeout(), > this, > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org