[
https://issues.apache.org/jira/browse/SOLR-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877320#comment-16877320
]
ASF subversion and git services commented on SOLR-13599:
--------------------------------------------------------
Commit b4a602f6b24196273adbdb7d47bf42fa8d08d807 in lucene-solr's branch
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b4a602f ]
SOLR-13599: additional 'checkpoint' logging to try and help diagnose strange
failures
> ReplicationFactorTest high failure rate on Windows jenkins VMs after
> 2019-06-22 OS/java upgrades
> ------------------------------------------------------------------------------------------------
>
> Key: SOLR-13599
> URL: https://issues.apache.org/jira/browse/SOLR-13599
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: thetaphi_Lucene-Solr-master-Windows_8025.log.txt
>
>
> We've started seeing some weirdly consistent (but not reliably reproducible)
> failures from ReplicationFactorTest when running on Uwe's Windows jenkins
> machines.
> The failures all seem to have started on June 22 -- when Uwe upgraded his
> Windows VMs to upgrade the Java version, but happen across all versions of
> java tested, and on both the master and branch_8x.
> While this test failed a total of 5 times, in different ways, on various
> jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on
> all but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and
> when it fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins
> builds frequently fails anywhere from 1-4 additional times.
> All of these failures occur in the exact same place, with the exact same
> assertion: that the expected replicationFactor of 2 was not achieved, and an
> rf=1 (ie: only the master) was returned, when sending a _batch_ of documents
> to a collection with 1 shard, 3 replicas; while 1 of the replicas was
> partitioned off due to a closed proxy.
> In the handful of logs I've examined closely, the 2nd "live" replica does in
> fact log that it recieved & processed the update, but with a QTime of over 30
> seconds, and it then it immediately logs an
> {{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception --
> meanwhile, the leader has one ({{updateExecutor}} thread logging copious
> amount of {{java.net.ConnectException: Connection refused: no further
> information}} regarding the replica that was partitioned off, before a second
> {{updateExecutor}} thread ultimately logs
> {{java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live"
> replica.
> ----
> What makes this perplexing is that this is not the first time in the test
> that documents were added to this collection while one replica was
> partitioned off, but it is the first time that all 3 of the following are
> true _at the same time_:
> # the collection has recovered after some replicas were partitioned and
> re-connected
> # a batch of multiple documents is being added
> # one replica has been "re" partitioned.
> ...prior to the point when this failure happens, only individual document
> adds were tested while replicas where partitioned. Batches of adds were only
> tested when all 3 replicas were "live" after the proxies were re-opened and
> the collection had fully recovered. The failure also comes from the first
> update to happen after a replica's proxy port has been "closed" for the
> _second_ time.
> While this conflagration of events might concievible trigger some weird bug,
> what makes these failures _particularly_ perplexing is that:
> * the failures only happen on Windows
> * the failures only started after the Windows VM update on June-22.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]