[
https://issues.apache.org/jira/browse/SOLR-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712222#comment-14712222
]
Yonik Seeley commented on SOLR-7836:
------------------------------------
I've been running ChaosMonkeySafeLeaderTest for about 3 days with my test
script that also searches for corrupt indexes or assertion failures even when
the test still passes.
Current trunk (as of last week): 9 corrupt indexes
Patched trunk: 14 corrupt indexes and 2 test failures (inconsistent shards)
The corrupt indexes *may* not be a problem, I don't really know. We kill off
servers, perhaps during replication? Seems like that could produce corrupt
indexes, but I don't know if that's the scenario or not. Increasing the
incidence of those doesn't necessarily point to a problem either. But
inconsistent shards vs not... does seem like a problem if it holds.
I've reviewed the locking code again, and it looks solid, so I'm not sure
what's going on.
Here's a typical corrupt index trace:
{code}
2> 21946 WARN (RecoveryThread-collection1) [n:127.0.0.1:51815_ c:collection1
s:shard1 r:core_node2 x:collection1] o.a.s.h.IndexFetcher Could not retrie
ve checksum from file.
2> org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file
truncated?): actual footer=1698720114 vs expected footer=-1071082520
(resource=MMapIndexInput(path="/opt/code/lusolr_clean2/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest_B7DC9C42462BF20D-001/shard-2-001/cores/collection1/data/index/_0.fdt"))
2> at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:416)
2> at
org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:401)
2> at
org.apache.solr.handler.IndexFetcher.compareFile(IndexFetcher.java:876)
2> at
org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:839)
2> at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:437)
2> at
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:265)
2> at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:382)
2> at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
2> at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
2> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
{code}
> Possible deadlock when closing refcounted index writers.
> --------------------------------------------------------
>
> Key: SOLR-7836
> URL: https://issues.apache.org/jira/browse/SOLR-7836
> Project: Solr
> Issue Type: Bug
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Fix For: Trunk, 5.4
>
> Attachments: SOLR-7836-reorg.patch, SOLR-7836-synch.patch,
> SOLR-7836.patch, SOLR-7836.patch, SOLR-7836.patch, SOLR-7836.patch,
> deadlock_3.res.zip, deadlock_5_pass_iw.res.zip, deadlock_test
>
>
> Preliminary patch for what looks like a possible race condition between
> writerFree and pauseWriter in DefaultSorlCoreState.
> Looking for comments and/or why I'm completely missing the boat.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]