[ 
https://issues.apache.org/jira/browse/SOLR-16412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722938#comment-17722938
 ] 

David Smiley commented on SOLR-16412:
-------------------------------------

It's not mutually exclusive, but I'd prefer to see a prevention of multiple 
cleanups happening concurrently.  In the description you imply this would be 
blocking but it need not be.  For example, if the data structure is currently 
being cleaned up, then the other thread can just proceed to add data.  

> Race condition could trigger error on concurrent SizeLimitedDistributedMap 
> cleanup
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-16412
>                 URL: https://issues.apache.org/jira/browse/SOLR-16412
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 8.8, 9.1, main (10.0)
>            Reporter: Patson Luk
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>             Fix For: 9.1, main (10.0)
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> h2. Description
> Exception below is observed while updating the `completedMap` field in 
> `OverseerTaskProcessor` :
> {{o.a.s.c.OverseerTaskProcessor 
> :org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for 
> /overseer/collection-map-completed/mn-736f6c726d616e2d312d31383930383730393837313333303932353331}}
> {{at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)}}
> {{at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)}}
> {{at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)}}
> {{at 
> org.apache.solr.common.cloud.SolrZkClient.lambda$delete$1(SolrZkClient.java:264)}}
> {{at 
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:71)}}
> {{at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:263)}}
> {{at 
> org.apache.solr.cloud.SizeLimitedDistributedMap.put(SizeLimitedDistributedMap.java:76)}}
> {{at 
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:538)}}
> {{at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)}}
> {{at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
> {{at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
> h2. Cause
> Based on the stack trace, `SizeLimitedDistributedMap` had reached the limit 
> and attempted to cleanup entries:
> [https://github.com/fullstorydev/lucene-solr/blob/75e89929eb360b513ee864aeb23a80c049747246/solr/core/src/java/org/apache/solr/cloud/SizeLimitedDistributedMap.java#L73-L80]
> However, when it performs the actual deletion, it failed with 
> `NoNodeException`
> This is likely caused by race condition as multiple threads can enter the 
> same code block and try to delete same list of children which the slower 
> threads can delete on child node that no longer exists.
>  
> Such condition can be reproduced by unit test case, which will be included in 
> the PR
> h2. Solution
> Although we could enforce synchronization to prevent threads from purging the 
> same set of child nodes, it might not be desirable to add extra blocking.
> Instead, it's probably safe to ignore the `KeeperException.NoNodeException` 
> if such node is no longer there for the purge operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to