[ 
https://issues.apache.org/jira/browse/SOLR-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550417#comment-14550417
 ] 

Shawn Heisey commented on SOLR-7511:
------------------------------------

There are people who run a single Solr instance in production, even though we 
strongly recommend high availability practices.  I think some of those people 
are also using their single Solr instance as a primary data store.

Rolling over to a new index directory definitely seems smarter -- keep the 
corrupt index around in case the user wants to attempt data recovery.  If we 
just purge it, somebody is going to complain loudly that we deleted all their 
data.

> Unable to open searcher when chaosmonkey is actively restarting solr and data 
> nodes
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-7511
>                 URL: https://issues.apache.org/jira/browse/SOLR-7511
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10.3
>            Reporter: Hrishikesh Gadre
>
> I have a working chaos-monkey setup which is killing (and restarting) solr 
> and data nodes in a round-robin fashion periodically. I wrote a simple Solr 
> client to periodically index and query bunch of documents. After executing 
> the test for some time, Solr returns incorrect number of documents. In the 
> background, I see following errors,
> org.apache.solr.common.SolrException: Error opening new searcher
>         at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1577)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1689)
>         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:856)
>         ... 8 more
> Caused by: java.io.EOFException: read past EOF
>         at 
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.refill(CustomBufferedIndexInput.java:186)
>         at 
> org.apache.solr.store.blockcache.CustomBufferedIndexInput.readByte(CustomBufferedIndexInput.java:46)
>         at 
> org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
>         at org.apache.lucene.store.DataInput.readInt(DataInput.java:98)
>         at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:134)
>         at 
> org.apache.lucene.codecs.lucene46.Lucene46SegmentInfoReader.read(Lucene46SegmentInfoReader.java:54)
>         at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:358)
>         at 
> org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454)
>         at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906)
>         at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752)
>         at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450)
>         at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:792)
>         at 
> org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
>         at 
> org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
> The issue here is that the index state for one of the replica is corrupt 
> (verified using Lucene CheckIndex tool). Hence Solr is not able to load the 
> core on that particular instance. 
> Interestingly when the other sane replica comes online, it tries to do a 
> peer-sync to this failing replica and gets an error, it also moves to 
> recovering state. As a result this particular shard is completely unavailable 
> for read/write requests. Here is a sample log entries on this sane replica,
> Error opening new searcher,trace=org.apache.solr.common.SolrException: 
> SolrCore 'customers_shard1_replica1' is not available due to init failure: 
> Error opening new searcher
>         at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
>         at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:211)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>         at 
> org.apache.solr.servlet.SolrHadoopAuthenticationFilter$2.doFilter(SolrHadoopAuthenticationFilter.java:288)
>         at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
>         at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:277)
> 2015-05-07 12:41:49,954 INFO org.apache.solr.update.PeerSync: PeerSync: 
> core=customers_shard1_replica2 
> url=http://ssl-systests-3.ent.cloudera.com:8983/solr DONE. sync failed
> 2015-05-07 12:41:49,954 INFO org.apache.solr.cloud.SyncStrategy: Leader's 
> attempt to sync with shard failed, moving to the next candidate
> 2015-05-07 12:41:50,007 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: There may be a better 
> leader candidate than us - going back into recovery
> 2015-05-07 12:41:50,007 INFO org.apache.solr.cloud.ElectionContext: canceling 
> election 
> /collections/customers/leader_elect/shard1/election/93773657844879326-core_node6-n_0000001722
> 2015-05-07 12:41:50,020 INFO org.apache.solr.update.DefaultSolrCoreState: 
> Running recovery - first canceling any ongoing recovery
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: The last 
> recovery attempt started 2685ms ago.
> 2015-05-07 12:41:50,020 INFO org.apache.solr.cloud.ActionThrottle: Throttling 
> recovery attempts - waiting for 7314ms
> I am able to reproduce this problem consistently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to