[ 
https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174420#comment-14174420
 ] 

Jessica Cheng Mallet commented on SOLR-6336:
--------------------------------------------

Please let me know if I'm supposed to open a new issue (not sure what the 
policy is).

I'm encountering a bug from this patch where now I'm stuck in a loop making 
getChildren() request to zookeeper with this thread dump:
{quote}
Thread-51 [WAITING] CPU time: 1d 15h 0m 57s
java.lang.Object.wait()
org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, 
ZooKeeper$WatchRegistration)
org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher)
org.apache.solr.common.cloud.SolrZkClient$6.execute()<2 recursive calls>
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation)
org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, boolean)
org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher)
org.apache.solr.cloud.DistributedQueue.getChildren(long)
org.apache.solr.cloud.DistributedQueue.peek(long)
org.apache.solr.cloud.DistributedQueue.peek(boolean)
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run()
java.lang.Thread.run()
{quote}

Looking at the code, I think the issue is that LatchChildWatcher#process always 
sets the event to its member, regardless of its type, but the problem is that 
once an event is set, the await no longer waits. In this state, the while loop 
in getChildren(long), when called with wait being Integer.MAX_VALUE will come 
back, NOT wait at await because event != null, but then it still will not get 
any children.

{quote}
    while (true) {
      if (!children.isEmpty()) break;
      watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait);
      if (watcher.getWatchedEvent() != null) {
        children = orderedChildren(null);
      }
      if (wait != Long.MAX_VALUE) break;
    }
{quote}

I think the fix would be to only set the event in the watcher if the type is a 
NodeChildrenChanged.

> DistributedQueue (and it's use in OCP) leaks ZK Watches
> -------------------------------------------------------
>
>                 Key: SOLR-6336
>                 URL: https://issues.apache.org/jira/browse/SOLR-6336
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Ramkumar Aiyengar
>            Assignee: Mark Miller
>             Fix For: 4.10, Trunk
>
>
> The current {{DistributedQueue}} implementation leaks ZK watches whenever it 
> finds children or times out on finding one. OCP uses this in its event loop 
> and can loop tight in some conditions (when exclusivity checks fail), leading 
> to lots of watches which get triggered together on the next event (could be a 
> while for some activities like shard splitting).
> This gets exposed by SOLR-6261 which spawns a new thread for every parallel 
> watch event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to