[ https://issues.apache.org/jira/browse/SOLR-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703853#comment-17703853 ]
David Smiley commented on SOLR-11685: ------------------------------------- This issue was created as specific to CollectionsAPIDistributedZkTest.testCollectionsAPI but nowadays, this test seems to fail only very rarely. No failures in the last 30 days – can't see this at all on: [http://fucit.org/solr-jenkins-reports/failure-report.html|http://fucit.org/solr-jenkins-reports/failure-report.html]. Looking at build emails, it failed about 30 days ago. There are no logs outside of the mail; I don't see the infamous "ClusterState says we are the leader". When I search for that string in my email for other tests showing this, I see a more consistent track record of org.apache.solr.cloud.BasicDistributedZkTest.test throwing this, so maybe this JIRA issue could be generalized to this exception no matter which test or production/real-world scenario produces it? As it happens, I have a test in a fork of Solr that causes this failure half the time on a split shard test that is rather simple (notwithstanding inherent complexities of shard splits itself). After debugging it, I came to a similar to conclusion -- this error should be caught and retried by the caller. It turns out, this is as easy as changing the HTTP status code from SERVICE_UNAVAILABLE to INVALID_STATE. I see another problem based on my test. A shard being split (a so-called parent shard) or that which recently completed (thus may have state INACTIVE) receives docs from a client (the test) and forwards to the sub-shards. But a sub-shard fails for the error shown above, and it does *not* bubble this up to the client; it's swallowed as okay. Changing the status code may fix for invalid state but wouldn't for other general errors (e.g. host went down suddenly). The result is data loss. > CollectionsAPIDistributedZkTest.testCollectionsAPI fails regularly with > leader mismatch > --------------------------------------------------------------------------------------- > > Key: SOLR-11685 > URL: https://issues.apache.org/jira/browse/SOLR-11685 > Project: Solr > Issue Type: Improvement > Reporter: Varun Thacker > Assignee: Varun Thacker > Priority: Major > Attachments: jenkins_7x_257.log, jenkins_master_7045.log, > solr_master_7574.log, solr_master_8983.log > > > I've been noticing lots of failures on Jenkins where the document add get's > rejected because of leader conflict and throws an error like > {code} > ClusterState says we are the leader > (https://127.0.0.1:38715/solr/awhollynewcollection_0_shard2_replica_n2), but > locally we don't think so. Request came from null > {code} > Scanning Jenkins logs I see that these failures have increased since Sept > 28th and has been failing daily. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org