[ 
https://issues.apache.org/jira/browse/SOLR-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703853#comment-17703853
 ] 

David Smiley commented on SOLR-11685:
-------------------------------------

This issue was created as specific to 
CollectionsAPIDistributedZkTest.testCollectionsAPI but nowadays, this test 
seems to fail only very rarely.  No failures in the last 30 days – can't see 
this at all on: 
[http://fucit.org/solr-jenkins-reports/failure-report.html|http://fucit.org/solr-jenkins-reports/failure-report.html].
  Looking at build emails, it failed about 30 days ago.  There are no logs 
outside of the mail; I don't see the infamous "ClusterState says we are the 
leader".  When I search for that string in my email for other tests showing 
this, I see a more consistent track record of 
org.apache.solr.cloud.BasicDistributedZkTest.test throwing this, so maybe this 
JIRA issue could be generalized to this exception no matter which test or 
production/real-world scenario produces it?

As it happens, I have a test in a fork of Solr that causes this failure half 
the time on a split shard test that is rather simple (notwithstanding inherent 
complexities of shard splits itself).  After debugging it, I came to a similar 
to conclusion -- this error should be caught and retried by the caller.  It 
turns out, this is as easy as changing the HTTP status code from 
SERVICE_UNAVAILABLE to INVALID_STATE.

I see another problem based on my test.  A shard being split (a so-called 
parent shard) or that which recently completed (thus may have state INACTIVE) 
receives docs from a client (the test) and forwards to the sub-shards.  But a 
sub-shard fails for the error shown above, and it does *not* bubble this up to 
the client; it's swallowed as okay.  Changing the status code may fix for 
invalid state but wouldn't for other general errors (e.g. host went down 
suddenly).  The result is data loss.

> CollectionsAPIDistributedZkTest.testCollectionsAPI fails regularly with 
> leader mismatch
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-11685
>                 URL: https://issues.apache.org/jira/browse/SOLR-11685
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Varun Thacker
>            Assignee: Varun Thacker
>            Priority: Major
>         Attachments: jenkins_7x_257.log, jenkins_master_7045.log, 
> solr_master_7574.log, solr_master_8983.log
>
>
> I've been noticing lots of failures on Jenkins where the document add get's 
> rejected because of leader conflict and throws an error like 
> {code}
> ClusterState says we are the leader 
> (https://127.0.0.1:38715/solr/awhollynewcollection_0_shard2_replica_n2), but 
> locally we don't think so. Request came from null
> {code}
> Scanning Jenkins logs I see that these failures have increased since Sept 
> 28th and has been failing daily.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to