[
https://issues.apache.org/jira/browse/SOLR-10806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039047#comment-16039047
]
Shawn Heisey commented on SOLR-10806:
-------------------------------------
This seems to be a very low level Lucene problem, possibly caused by building
an index with one schema, then trying to change the schema and use it with an
existing index. You're indicating that this is happening during collection
creation ... so I'm wondering if perhaps you have some existing core/index
directories left over from a previous version of the collection, and Solr is
trying to recreate a core with a directory that already exists and contains an
index, and finding that the existing index isn't compatible with the new schema.
Generally speaking, most schema changes require a reindex, and sometimes the
entire index must be completely wiped out before starting the reindex, because
of problems like this.
https://wiki.apache.org/solr/HowToReindex
Low-level Lucene problems are very difficult for Solr to handle cleanly.
You're right that this shouldn't cause everything to grind to a halt, but it
may be challenging to achieve a reasonable outcome when there is a very
low-level Lucene problem. We should try, I'm just warning you in advance that
it might not be easy.
> Solr Replica goes down with NumberFormatException: Invalid shift value (64)
> in prefixCoded bytes (is encoded value really an INT?)
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-10806
> URL: https://issues.apache.org/jira/browse/SOLR-10806
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 6.3
> Reporter: Sachin Goyal
>
> Our Solr nodes go down within 20-30 minutes of indexing.
> It does not seem that load-rate is too high because the exception in the logs
> is pointing to a data problem:
> {color:darkred}
> INFO - 2017-06-02 23:21:19.094; org.apache.solr.core.SolrCore;
> \[node-instances_shard2_replica3\] Registered new searcher
> Searcher@6740879c\[node-instances_shard2_replica3\]
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_ne(6.3.0):C200591/8616:delGen=20)
> Uninverting(_wx(6.3.0):C72132/697:delGen=5)
> Uninverting(_y0(6.3.0):c5798/27:delGen=3)
> Uninverting(_yv(6.3.0):c10935/827:delGen=2)
> Uninverting(_z4(6.3.0):C4163/2277:delGen=1)))}
> ERROR - 2017-06-02 23:21:19.105; org.apache.solr.core.CoreContainer; Error
> waiting for SolrCore to be created
> java.util.concurrent.ExecutionException:
> org.apache.solr.common.SolrException: Unable to create core
> \[node-instances_shard2_replica3\]
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at
> org.apache.solr.core.CoreContainer.lambda$load$1(CoreContainer.java:526)
> at
> org.apache.solr.core.CoreContainer$$Lambda$38/199449817.run(Unknown Source)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$9/1611272577.run(Unknown
> Source)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.solr.common.SolrException: Unable to create core
> \[node-instances_shard2_replica3\]
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:855)
> at
> org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:498)
> at
> org.apache.solr.core.CoreContainer$$Lambda$37/1402433372.call(Unknown Source)
> ... 6 more
> Caused by: java.lang.NumberFormatException: Invalid shift value (64) in
> prefixCoded bytes (is encoded value really an INT?)
> at
> org.apache.lucene.util.LegacyNumericUtils.getPrefixCodedLongShift(LegacyNumericUtils.java:163)
> at
> org.apache.lucene.util.LegacyNumericUtils$1.accept(LegacyNumericUtils.java:392)
> at
> org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:232)
> at org.apache.lucene.index.Terms.getMax(Terms.java:169)
> at
> org.apache.lucene.util.LegacyNumericUtils.getMaxLong(LegacyNumericUtils.java:504)
> at
> org.apache.solr.update.VersionInfo.getMaxVersionFromIndex(VersionInfo.java:233)
> at
> org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1584)
> at
> org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1610)
> at org.apache.solr.core.SolrCore.seedVersionBuckets(SolrCore.java:949)
> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:931)
> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:776)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:842)
> ... 8 more
> {color}
> It does not seem right that Solr Node itself should go down for such a
> problem.
> # Error waiting for SolrCore to be created
> java.util.concurrent.ExecutionException:
> org.apache.solr.common.SolrException: Unable to create core
> # Unable to create core
> # NumberFormatException: Invalid shift value (64) in prefixCoded bytes (is
> encoded value really an INT?)
> i.e. Core creation fails because there was some confusion between long and
> integer.
> If there is a data issue then somehow it should communicate it with an
> exception during ingestion.
> \\
> \\
> *UPDATE*:
> Another issue I see with the above problem is that solr cluster is completely
> inaccessible.
> Solr-UI is also not coming up. I restarted the Solr servers and they refuse
> to recover.
> I am not even able to delete the collections and create them afresh.
> It seems the only way out is to do an *rm -rf* and re-install
> Note that it is not related to network as I can ssh to the Solr machines and
> send messages to other Solr machines using nc
> \\
> \\
> *UPDATE 2*:
> I had a 24 node cluster with 2 collections.
> Each collection used 6 nodes and had 2 shard, 3 replica configuration.
> So 12 nodes used out of 24 nodes.
> Rest 12 nodes had Solr running with same zookeeper but no collections/cores.
> After the above errors begin to happen, Solr-UI of all 24 nodes became
> unresponsive!
> So I tried the delete-collection API from the command line - no response.
> Ultimately I ran the delete-collection from the command line in a loop and it
> deleted a part of the collection.
> Then I had to manually delete the *<coreName>/data/index/write.lock* file on
> some nodes to purge those bad collections.
> Its been a few hours since then. There are no collections and still few nodes
> are unresponsive with following messages in the logs:
> {color:brown}
> INFO - 2017-06-03 06:40:51.308; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> INFO - 2017-06-03 06:40:51.408; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> INFO - 2017-06-03 06:40:51.508; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> INFO - 2017-06-03 06:40:51.608; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> {color}
> It looks like a serious stability problem to me.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]