[ 
https://issues.apache.org/jira/browse/CASSANDRA-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764573#comment-17764573
 ] 

Berenguer Blasi commented on CASSANDRA-17296:
---------------------------------------------

This was marked as a 5.x problem but it affects all the versions. The root 
issue was found digging the logs:

{noformat}
Process data_checker:
Traceback (most recent call last):
  File 
"/home/cassandra/cassandra/cassandra-dtest/upgrade_tests/upgrade_through_versions_test.py",
 line 133, in data_checker
    actual_val = session.execute(prepared, (key,))[0][0]
  File 
"/home/cassandra/cassandra/venv/src/cassandra-driver/cassandra/cluster.py", 
line 2618, in execute
    return self.execute_async(query, parameters, trace, custom_payload, 
timeout, execution_profile, paging_state, host, execute_as).result()
  File 
"/home/cassandra/cassandra/venv/src/cassandra-driver/cassandra/cluster.py", 
line 4894, in result
    raise self._final_exception
cassandra.OperationTimedOut: errors={'Connection defunct by heartbeat': 'Client 
request timeout. See Session.execute[_async](timeout)'}, 
last_host=127.0.0.2:9042
{noformat}

Basically a timeout on background processes would blow up the test. Adding 3 
retries + graceful stop seems a reasonable solution and 100 repeats (expensive 
heavy long test) are green.

> Test Failure: 
> dtest-upgrade.upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD.test_rolling_upgrade
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17296
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17296
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Josh McKenzie
>            Assignee: Berenguer Blasi
>            Priority: Normal
>             Fix For: 3.0.30, 4.0.12, 4.1.4, 5.0-alpha2, 5.x
>
>
> 2 failures in 30, looks flaky on timing / subprocess termination.
> https://ci-cassandra.apache.org/job/Cassandra-trunk/920/testReport/dtest-upgrade.upgrade_tests.upgrade_through_versions_test/TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD/test_rolling_upgrade/
> Failed 2 times in the last 30 runs. Flakiness: 10%, Stability: 93%
> Error Message
> RuntimeError: A subprocess has terminated early. Subprocess statuses: 
> Process-1 (is_alive: True), Process-2 (is_alive: False), attempting to 
> terminate remaining subprocesses now.
> Stacktrace
> self = 
> <upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD
>  object at 0x7f22685cebb0>
>     @pytest.mark.timeout(3000)
>     def test_rolling_upgrade(self):
>         """
>             Test rolling upgrade of the cluster, so we have mixed versions 
> part way through.
>             """
> >       self.upgrade_scenario(rolling=True)
> upgrade_tests/upgrade_through_versions_test.py:320: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> upgrade_tests/upgrade_through_versions_test.py:398: in upgrade_scenario
>     self._check_on_subprocs(self.fixture_dtest_setup.subprocs)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = 
> <upgrade_tests.upgrade_through_versions_test.TestProtoV4Upgrade_AllVersions_RandomPartitioner_EndsAt_Trunk_HEAD
>  object at 0x7f22685cebb0>
> subprocs = [<Process name='Process-1' pid=28667 parent=314 stopped 
> exitcode=-SIGKILL daemon>, <Process name='Process-2' pid=28686 parent=314 
> stopped exitcode=1 daemon>]
>     def _check_on_subprocs(self, subprocs):
>         """
>             Check on given subprocesses.
>     
>             If any are not alive, we'll go ahead and terminate any remaining 
> alive subprocesses since this test is going to fail.
>             """
>         subproc_statuses = [s.is_alive() for s in subprocs]
>         if not all(subproc_statuses):
>             message = "A subprocess has terminated early. Subprocess 
> statuses: "
>             for s in subprocs:
>                 message += "{name} (is_alive: {aliveness}), 
> ".format(name=s.name, aliveness=s.is_alive())
>             message += "attempting to terminate remaining subprocesses now."
>             self._terminate_subprocs()
> >           raise RuntimeError(message)
> E           RuntimeError: A subprocess has terminated early. Subprocess 
> statuses: Process-1 (is_alive: True), Process-2 (is_alive: False), attempting 
> to terminate remaining subprocesses now.
> upgrade_tests/upgrade_through_versions_test.py:456: RuntimeError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to