[
https://issues.apache.org/jira/browse/SOLR-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004472#comment-14004472
]
Shalin Shekhar Mangar commented on SOLR-5309:
---------------------------------------------
I am looking at these failure again today. Yeah, it's been that busy around
here :(
I implemented a RateLimitedDirectoryFactory for Solr with a very small limit
and forced ShardSplitTest to use it always. This helped reproduce the issue for
me. I have finally managed to track down the root cause. It always perplexed me
that the difference between expected and actual doc counts was almost always 1.
Whenever we add/delete documents during shard splitting, we synchronously
forward the request to the appropriate sub-shard. For add requests, a single
sub-shard is selected but for delete by ids, we weren't selecting a single
sub-shard. Instead we are forwarding the delete by id to all sub-shards. This
works out fine and doesn't cause any damage in practice because the id exists
only on one shard. However, when one sub-shard (the right one) accepts the
delete and the other rejects it (maybe because it became active in the
mean-time) then the client (ShardSplitTest) gets an error back and assumes that
the delete did not succeed whereas it actually succeeded on the right sub-shard.
We always advise our users to retry update operations upon failure and they
would be fine if they follow this advise during shard splitting also.
ShardSplitTest unfortunately doesn't follow that advice and just counts
success/failures and ends up with an inconsistent state.
I'll start by fixing delete-by-id to route requests to the correct (single)
sub-shard and enabling this test again.
> Investigate ShardSplitTest failures
> -----------------------------------
>
> Key: SOLR-5309
> URL: https://issues.apache.org/jira/browse/SOLR-5309
> Project: Solr
> Issue Type: Task
> Components: SolrCloud
> Reporter: Shalin Shekhar Mangar
> Assignee: Shalin Shekhar Mangar
> Priority: Blocker
>
> Investigate why ShardSplitTest if failing sporadically.
> Some recent failures:
> http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3328/
> http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/7760/
> http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/861/
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]