[
https://issues.apache.org/jira/browse/SOLR-10806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041036#comment-16041036
]
Sachin Goyal commented on SOLR-10806:
-------------------------------------
There were no old re-indexes and it was a fresh Solr cluster. We were using
data-driven schema and our theory is that one of the shard guessed some field
to be as long while the other shard guessed the *same* field to be as integer.
If that is true, then its a pretty bad problem IMO which is difficult to
reproduce (because each shard should *simultaneously* guess the type of the
same field to be different). Also this is a problem that may not show up in
several test-runs but may show up directly in production because it depends on
race conditions between the shards.
And it still does not answer why the Solr UI is becoming unresponsive. Why is
the thread running Solr UI getting blocked due to any low-level problems?
> Solr Replica goes down with NumberFormatException: Invalid shift value (64)
> in prefixCoded bytes (is encoded value really an INT?)
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-10806
> URL: https://issues.apache.org/jira/browse/SOLR-10806
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Affects Versions: 6.3
> Reporter: Sachin Goyal
>
> Our Solr nodes go down within 20-30 minutes of indexing.
> It does not seem that load-rate is too high because the exception in the logs
> is pointing to a data problem:
> {color:darkred}
> INFO - 2017-06-02 23:21:19.094; org.apache.solr.core.SolrCore;
> \[node-instances_shard2_replica3\] Registered new searcher
> Searcher@6740879c\[node-instances_shard2_replica3\]
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_ne(6.3.0):C200591/8616:delGen=20)
> Uninverting(_wx(6.3.0):C72132/697:delGen=5)
> Uninverting(_y0(6.3.0):c5798/27:delGen=3)
> Uninverting(_yv(6.3.0):c10935/827:delGen=2)
> Uninverting(_z4(6.3.0):C4163/2277:delGen=1)))}
> ERROR - 2017-06-02 23:21:19.105; org.apache.solr.core.CoreContainer; Error
> waiting for SolrCore to be created
> java.util.concurrent.ExecutionException:
> org.apache.solr.common.SolrException: Unable to create core
> \[node-instances_shard2_replica3\]
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at
> org.apache.solr.core.CoreContainer.lambda$load$1(CoreContainer.java:526)
> at
> org.apache.solr.core.CoreContainer$$Lambda$38/199449817.run(Unknown Source)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$9/1611272577.run(Unknown
> Source)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.solr.common.SolrException: Unable to create core
> \[node-instances_shard2_replica3\]
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:855)
> at
> org.apache.solr.core.CoreContainer.lambda$load$0(CoreContainer.java:498)
> at
> org.apache.solr.core.CoreContainer$$Lambda$37/1402433372.call(Unknown Source)
> ... 6 more
> Caused by: java.lang.NumberFormatException: Invalid shift value (64) in
> prefixCoded bytes (is encoded value really an INT?)
> at
> org.apache.lucene.util.LegacyNumericUtils.getPrefixCodedLongShift(LegacyNumericUtils.java:163)
> at
> org.apache.lucene.util.LegacyNumericUtils$1.accept(LegacyNumericUtils.java:392)
> at
> org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:232)
> at org.apache.lucene.index.Terms.getMax(Terms.java:169)
> at
> org.apache.lucene.util.LegacyNumericUtils.getMaxLong(LegacyNumericUtils.java:504)
> at
> org.apache.solr.update.VersionInfo.getMaxVersionFromIndex(VersionInfo.java:233)
> at
> org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1584)
> at
> org.apache.solr.update.UpdateLog.seedBucketsWithHighestVersion(UpdateLog.java:1610)
> at org.apache.solr.core.SolrCore.seedVersionBuckets(SolrCore.java:949)
> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:931)
> at org.apache.solr.core.SolrCore.<init>(SolrCore.java:776)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:842)
> ... 8 more
> {color}
> It does not seem right that Solr Node itself should go down for such a
> problem.
> # Error waiting for SolrCore to be created
> java.util.concurrent.ExecutionException:
> org.apache.solr.common.SolrException: Unable to create core
> # Unable to create core
> # NumberFormatException: Invalid shift value (64) in prefixCoded bytes (is
> encoded value really an INT?)
> i.e. Core creation fails because there was some confusion between long and
> integer.
> If there is a data issue then somehow it should communicate it with an
> exception during ingestion.
> \\
> \\
> *UPDATE*:
> Another issue I see with the above problem is that solr cluster is completely
> inaccessible.
> Solr-UI is also not coming up. I restarted the Solr servers and they refuse
> to recover.
> I am not even able to delete the collections and create them afresh.
> It seems the only way out is to do an *rm -rf* and re-install
> Note that it is not related to network as I can ssh to the Solr machines and
> send messages to other Solr machines using nc
> \\
> \\
> *UPDATE 2*:
> I had a 24 node cluster with 2 collections.
> Each collection used 6 nodes and had 2 shard, 3 replica configuration.
> So 12 nodes used out of 24 nodes.
> Rest 12 nodes had Solr running with same zookeeper but no collections/cores.
> After the above errors begin to happen, Solr-UI of all 24 nodes became
> unresponsive!
> So I tried the delete-collection API from the command line - no response.
> Ultimately I ran the delete-collection from the command line in a loop and it
> deleted a part of the collection.
> Then I had to manually delete the *<coreName>/data/index/write.lock* file on
> some nodes to purge those bad collections.
> Its been a few hours since then. There are no collections and still few nodes
> are unresponsive with following messages in the logs:
> {color:brown}
> INFO - 2017-06-03 06:40:51.308; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> INFO - 2017-06-03 06:40:51.408; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> INFO - 2017-06-03 06:40:51.508; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> INFO - 2017-06-03 06:40:51.608; org.apache.solr.core.SolrCore; Core
> sync-status_shard1_replica2 is not yet closed, waiting 100 ms before checking
> again.
> {color}
> It looks like a serious stability problem to me.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]