Hoss Man created SOLR-13599:
-------------------------------
Summary: ReplicationFactorTest high failure rate on Windows
jenkins VMs after 2019-06-22 OS/java upgrades
Key: SOLR-13599
URL: https://issues.apache.org/jira/browse/SOLR-13599
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Reporter: Hoss Man
We've started seeing some weirdly consistent (but not reliably reproducible)
failures from ReplicationFactorTest when running on Uwe's Windows jenkins
machines.
The failures all seem to have started on June 22 -- when Uwe upgraded his
Windows VMs to upgrade the Java version, but happen across all versions of java
tested, and on both the master and branch_8x.
While this test failed a total of 5 times, in different ways, on various
jenkins boxes between 2019-01-01 and 2019-06-21, it seems to have failed on all
but 1 or 2 of Uwe's "Windows" jenkins builds since that 2019-06-22, and when it
fails the {{reproduceJenkinsFailures.py}} logic used in Uwe's jenkins builds
frequently fails anywhere from 1-4 additional times.
All of these failures occur in the exact same place, with the exact same
assertion: that the expected replicationFactor of 2 was not achieved, and an
rf=1 (ie: only the master) was returned, when sending a _batch_ of documents to
a collection with 1 shard, 3 replicas; while 1 of the replicas was partitioned
off due to a closed proxy.
In the handful of logs I've examined closely, the 2nd "live" replica does in
fact log that it recieved & processed the update, but with a QTime of over 30
seconds, and it then it immediately logs an
{{org.eclipse.jetty.io.EofException: Reset cancel_stream_error}} Exception --
meanwhile, the leader has one ({{updateExecutor}} thread logging copious amount
of {{java.net.ConnectException: Connection refused: no further information}}
regarding the replica that was partitioned off, before a second
{{updateExecutor}} thread ultimately logs
{{java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: idle_timeout}} regarding the "live"
replica.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]