[ https://issues.apache.org/jira/browse/SOLR-16412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621338#comment-17621338 ]
Kevin Risden commented on SOLR-16412: ------------------------------------- SOLR-16175 has a branch but not PR with this commit. https://github.com/apache/solr/commit/c1ae998e7ad3650229449ff7b2a55ef222ec8b8c > Race condition could trigger error on concurrent SizeLimitedDistributedMap > cleanup > ---------------------------------------------------------------------------------- > > Key: SOLR-16412 > URL: https://issues.apache.org/jira/browse/SOLR-16412 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 8.8, main (10.0) > Reporter: Patson Luk > Assignee: Ishan Chattopadhyaya > Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > h2. Description > Exception below is observed while updating the `completedMap` field in > `OverseerTaskProcessor` : > {{o.a.s.c.OverseerTaskProcessor > :org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for > /overseer/collection-map-completed/mn-736f6c726d616e2d312d31383930383730393837313333303932353331}} > {{at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)}} > {{at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)}} > {{at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)}} > {{at > org.apache.solr.common.cloud.SolrZkClient.lambda$delete$1(SolrZkClient.java:264)}} > {{at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)}} > {{at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:263)}} > {{at > org.apache.solr.cloud.SizeLimitedDistributedMap.put(SizeLimitedDistributedMap.java:76)}} > {{at > org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:538)}} > {{at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)}} > {{at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}} > {{at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}} > h2. Cause > Based on the stack trace, `SizeLimitedDistributedMap` had reached the limit > and attempted to cleanup entries: > [https://github.com/fullstorydev/lucene-solr/blob/75e89929eb360b513ee864aeb23a80c049747246/solr/core/src/java/org/apache/solr/cloud/SizeLimitedDistributedMap.java#L73-L80] > However, when it performs the actual deletion, it failed with > `NoNodeException` > This is likely caused by race condition as multiple threads can enter the > same code block and try to delete same list of children which the slower > threads can delete on child node that no longer exists. > > Such condition can be reproduced by unit test case, which will be included in > the PR > h2. Solution > Although we could enforce synchronization to prevent threads from purging the > same set of child nodes, it might not be desirable to add extra blocking. > Instead, it's probably safe to ignore the `KeeperException.NoNodeException` > if such node is no longer there for the purge operation. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org