[jira] [Commented] (SOLR-10720) Aggressive removal of a collection breaks cluster state

Varun Thacker (JIRA) Tue, 10 Jul 2018 13:14:14 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539179#comment-16539179
 ]


Varun Thacker commented on SOLR-10720:
--------------------------------------

{code:java}
} catch (SolrException e) {
  if (e.getCause() instanceof KeeperException.NoNodeException)  {
    // skip this collection because the collection's znode has been deleted
    // which can happen during aggressive collection removal, see SOLR-10720
  } else throw e;
}{code}
 

In this code block should we force fetch the new zk state? 

We saw the same issue with Solr 7.2.1 and only a restart fixes the issue. 

There was one scenario when a create collection failed because of this error. 
I'm trying to capture logs when this happened but that's what motivated me to 
say we should refresh the zk state when we hit this error.

> Aggressive removal of a collection breaks cluster state
> -------------------------------------------------------
>
>                 Key: SOLR-10720
>                 URL: https://issues.apache.org/jira/browse/SOLR-10720
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.5.1
>            Reporter: Alexey Serba
>            Assignee: Shalin Shekhar Mangar
>            Priority: Major
>             Fix For: 7.3, master (8.0)
>
>         Attachments: SOLR-10720.patch
>
>
> We are periodically seeing tricky concurrency bug in SolrCloud that starts 
> with `Could not fully remove collection: my_collection` exception:
> {noformat}
> 2017-05-17T14:47:50,153 - ERROR 
> [OverseerThreadFactory-6-thread-5:SolrException@159] - {} - Collection: 
> my_collection operation: delete failed:org.apache.solr.common.SolrException: 
> Could not fully remove collection: my_collection
>         at 
> org.apache.solr.cloud.DeleteCollectionCmd.call(DeleteCollectionCmd.java:106)
>         at 
> org.apache.solr.cloud.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:224)
>         at 
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:463)
> {noformat}
> After that all operations with SolrCloud that involve reading cluster state 
> fail with
> {noformat}
> org.apache.solr.common.SolrException: Error loading config name for 
> collection my_collection
>     at 
> org.apache.solr.common.cloud.ZkStateReader.readConfigName(ZkStateReader.java:198)
>     at 
> org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:141)
> ...
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /collections/my_collection
> ...
> {noformat}
> See full 
> [stacktraces|https://gist.github.com/serba/9b7932f005f34f6cd9a511e226c6f0c6]
> As a result SolrCloud becomes completely broken. We are seeing this with 
> 6.5.1 but I think we’ve seen that with older versions too.
> From looking into the code it looks like it is a combination of two factors:
> * Forcefully removing collection's znode in finally block in 
> [DeleteCollectionCmd|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L115]
>  that was introduced in SOLR-5135. Note that this causes cached cluster state 
> to be not in sync with the state in Zk, i.e. 
> {{zkStateReader.getClusterState()}} still has collection in it (see the code 
> [here|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.5.1/solr/core/src/java/org/apache/solr/cloud/DeleteCollectionCmd.java#L98])
>  whereas {{/collections/<collection_id>}} znode in Zk is already removed.
> * Reading cluster state operation not only returns cached version, but it is 
> also reading collection's config name from {{/collections/<collection_id>}} 
> znode, but this znode was forcefully removed. The code to read config name 
> for every collection directly from Zk was introduced in SOLR-7636. Isn't 
> there any performance implications of reading N znodes (1 per collection) on 
> every {{getClusterStatus}} call? 
> I'm not sure what the proper fix should be
> * Should we just catch {{KeeperException$NoNodeException}} in 
> {{getClusterStatus}} and treat such collection as removed? That looks easiest 
> / less invasive fix.
> * Should we stop reading config name from collection znode and get it from 
> cache somehow?
> * Should we not try to delete collection's data from Zk if delete operation 
> failed?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-10720) Aggressive removal of a collection breaks cluster state

Reply via email to