[jira] [Commented] (SOLR-9836) Add more graceful recovery steps when failing to create SolrCore

Mike Drob (JIRA) Wed, 07 Dec 2016 15:18:10 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730254#comment-15730254
 ]


Mike Drob commented on SOLR-9836:
---------------------------------

bq. I'm not sure that is the right exception to catch - very brittle. We should 
probably be mostly looking for CorruptedIndexException and if that doesn't 
cover a case at the Lucene level, look at improving that there. Even if the 
case of a 0 byte segments file with nothing to roll back on throws an 
EOFException today, it may not tomorrow. I think that is the goal of the 
CorruptIndexException - you can actually have a little more than momentary 
confidence that your code is not treating exceptions one way while things 
change underneath you over time.
I could add a check somewhere along the chain that would turn an {{EOF}} into a 
{{CorruptIndex}}. However, I'm not confident enough in the lucene internals to 
know if this leads to eventual false positives somewhere...  It probably looks 
like:
{code:title=SegmentInfos.java:276}
     long generation = generationFromSegmentsFileName(segmentFileName);
     //System.out.println(Thread.currentThread() + ": SegmentInfos.readCommit " 
+ segmentFileName);
+    ChecksumIndexInput saved = null;
     try (ChecksumIndexInput input = 
directory.openChecksumInput(segmentFileName, IOContext.READ)) {
+      saved = input;
       return readCommit(directory, input, generation);
+    } catch (EOFException e) {
+      throw new CorruptIndexException("Unexpected end of file while reading 
index.", saved, e);
     }
   }
{code}

But the method javadoc worries me: {{* Read a particular segmentFileName.  Note 
that this may throw an IOException if a commit is in process.}}
Under what circumstances would this throw an IOException? Randomly returning 
CorruptIndex during normal operation is bad news.

> Add more graceful recovery steps when failing to create SolrCore
> ----------------------------------------------------------------
>
>                 Key: SOLR-9836
>                 URL: https://issues.apache.org/jira/browse/SOLR-9836
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Mike Drob
>         Attachments: SOLR-9836.patch
>
>
> I have seen several cases where there is a zero-length segments_n file. We 
> haven't identified the root cause of these issues (possibly a poorly timed 
> crash during replication?) but if there is another node available then Solr 
> should be able to recover from this situation. Currently, we log and give up 
> on loading that core, leaving the user to manually intervene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9836) Add more graceful recovery steps when failing to create SolrCore

Reply via email to