[
https://issues.apache.org/jira/browse/CASSANDRA-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764573#comment-17764573
]
Berenguer Blasi commented on CASSANDRA-17296:
---------------------------------------------
This was marked as a 5.x problem but it affects all the versions. The root
issue was found digging the logs:
{noformat}
Process data_checker:
Traceback (most recent call last):
File
"/home/cassandra/cassandra/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py",
line 133, in data_checker
actual_val = session.execute(prepared, (key,))[0][0]
File
"/home/cassandra/cassandra/venv/src/cassandra-driver/cassandra/cluster.py",
line 2618, in execute
return self.execute_async(query, parameters, trace, custom_payload,
timeout, execution_profile, paging_state, host, execute_as).result()
File
"/home/cassandra/cassandra/venv/src/cassandra-driver/cassandra/cluster.py",
line 4894, in result
raise self._final_exception
cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client
request timeout. See Session.execute[_async](timeout)'},
last_host=127.0.0.2:9042
{noformat}
Basically a timeout on background processes would blow up the test. Adding 3
retries + graceful stop seems a reasonable solution and 100 repeats (expensive
heavy long test) are green.
> Test Failure:
> dtest-upgrade.upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD.test_rolling_upgrade
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-17296
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17296
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/python
> Reporter: Josh McKenzie
> Assignee: Berenguer Blasi
> Priority: Normal
> Fix For: 3.0.30, 4.0.12, 4.1.4, 5.0-alpha2, 5.x
>
>
> 2 failures in 30, looks flaky on timing / subprocess termination.
> https://ci-cassandra.apache.org/job/Cassandra-trunk/920/testReport/dtest-upgrade.upgrade_tests.upgrade_through_versions_test/TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD/test_rolling_upgrade/
> Failed 2 times in the last 30 runs. Flakiness: 10%, Stability: 93%
> Error Message
> RuntimeError: A subprocess has terminated early. Subprocess statuses:
> Process-1 (is_alive: True), Process-2 (is_alive: False), attempting to
> terminate remaining subprocesses now.
> Stacktrace
> self =
> <upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD
> object at 0x7f22685cebb0>
> @pytest.mark.timeout(3000)
> def test_rolling_upgrade(self):
> """
> Test rolling upgrade of the cluster, so we have mixed versions
> part way through.
> """
> > self.upgrade_scenario(rolling=True)
> upgrade_tests/upgrade_through_versions_test.py:320:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _
> upgrade_tests/upgrade_through_versions_test.py:398: in upgrade_scenario
> self._check_on_subprocs(self.fixture_dtest_setup.subprocs)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _
> self =
> <upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD
> object at 0x7f22685cebb0>
> subprocs = [<Process name='Process-1' pid=28667 parent=314 stopped
> exitcode=-SIGKILL daemon>, <Process name='Process-2' pid=28686 parent=314
> stopped exitcode=1 daemon>]
> def _check_on_subprocs(self, subprocs):
> """
> Check on given subprocesses.
>
> If any are not alive, we'll go ahead and terminate any remaining
> alive subprocesses since this test is going to fail.
> """
> subproc_statuses = [s.is_alive() for s in subprocs]
> if not all(subproc_statuses):
> message = "A subprocess has terminated early. Subprocess
> statuses: "
> for s in subprocs:
> message += "{name} (is_alive: {aliveness}),
> ".format(name=s.name, aliveness=s.is_alive())
> message += "attempting to terminate remaining subprocesses now."
> self._terminate_subprocs()
> > raise RuntimeError(message)
> E RuntimeError: A subprocess has terminated early. Subprocess
> statuses: Process-1 (is_alive: True), Process-2 (is_alive: False), attempting
> to terminate remaining subprocesses now.
> upgrade_tests/upgrade_through_versions_test.py:456: RuntimeError
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]