[
https://issues.apache.org/jira/browse/SOLR-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730254#comment-15730254
]
Mike Drob commented on SOLR-9836:
---------------------------------
bq. I'm not sure that is the right exception to catch - very brittle. We should
probably be mostly looking for CorruptedIndexException and if that doesn't
cover a case at the Lucene level, look at improving that there. Even if the
case of a 0 byte segments file with nothing to roll back on throws an
EOFException today, it may not tomorrow. I think that is the goal of the
CorruptIndexException - you can actually have a little more than momentary
confidence that your code is not treating exceptions one way while things
change underneath you over time.
I could add a check somewhere along the chain that would turn an {{EOF}} into a
{{CorruptIndex}}. However, I'm not confident enough in the lucene internals to
know if this leads to eventual false positives somewhere... It probably looks
like:
{code:title=SegmentInfos.java:276}
long generation = generationFromSegmentsFileName(segmentFileName);
//System.out.println(Thread.currentThread() + ": SegmentInfos.readCommit "
+ segmentFileName);
+ ChecksumIndexInput saved = null;
try (ChecksumIndexInput input =
directory.openChecksumInput(segmentFileName, IOContext.READ)) {
+ saved = input;
return readCommit(directory, input, generation);
+ } catch (EOFException e) {
+ throw new CorruptIndexException("Unexpected end of file while reading
index.", saved, e);
}
}
{code}
But the method javadoc worries me: {{* Read a particular segmentFileName. Note
that this may throw an IOException if a commit is in process.}}
Under what circumstances would this throw an IOException? Randomly returning
CorruptIndex during normal operation is bad news.
> Add more graceful recovery steps when failing to create SolrCore
> ----------------------------------------------------------------
>
> Key: SOLR-9836
> URL: https://issues.apache.org/jira/browse/SOLR-9836
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: Mike Drob
> Attachments: SOLR-9836.patch
>
>
> I have seen several cases where there is a zero-length segments_n file. We
> haven't identified the root cause of these issues (possibly a poorly timed
> crash during replication?) but if there is another node available then Solr
> should be able to recover from this situation. Currently, we log and give up
> on loading that core, leaving the user to manually intervene.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]