Caleb Rackliffe created CASSANDRA-18347:
-------------------------------------------

             Summary: CEP-21: Startup failures in Python dtests around 
TCM_REPLAY_REQ
                 Key: CASSANDRA-18347
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18347
             Project: Cassandra
          Issue Type: Bug
            Reporter: Caleb Rackliffe


There are currently widespread, locally reproducible failures in the Python 
dtests against the {{cep-21-tcm}} branch. For example...
 
{noformat}pytest --cassandra-dir=/Users/maedhroz/Forks/cassandra 
topology_test.py::TestTopology::test_decommissioned_node_cant_rejoin{noformat}

{noformat}pytest --cassandra-dir=/Users/maedhroz/Forks/cassandra 
materialized_views_test.py::TestMaterializedViews::test_query_new_column{noformat}

{noformat}pytest --cassandra-dir=/Users/maedhroz/Forks/cassandra 
read_repair_test.py::TestSpeculativeReadRepair::test_normal_read_repair{noformat}

https://app.circleci.com/pipelines/github/maedhroz/cassandra/701/workflows/44a5c7e0-0de0-4839-bbd0-80771fe23843/jobs/7251

https://app.circleci.com/pipelines/github/beobal/cassandra/406/workflows/00cdb02e-4b3e-477a-b997-403121172249/jobs/4204/tests

The death spiral in the node startup logs starts like this…

{noformat}
WARN  [Messaging-EventLoop-3-1] 2023-03-17 11:55:34,037 NoSpamLogger.java:108 - 
/127.0.0.2:7000->/127.0.0.1:7000-SMALL_MESSAGES-[no-channel] dropping message 
of type TCM_REPLAY_REQ whose timeout expired before reaching the network
ERROR [InternalResponseStage:3] 2023-03-17 11:55:34,038 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.1:7000], checkLive=false}
INFO  [Messaging-EventLoop-3-12] 2023-03-17 11:55:34,099 
InboundConnectionInitiator.java:567 - 
/127.0.0.2:7000(/127.0.0.2:49763)->/127.0.0.2:7000-SMALL_MESSAGES-1b9301b6 
messaging connection established, version = 13, framing = CRC, encryption =
unencrypted
INFO  [Messaging-EventLoop-3-9] 2023-03-17 11:55:34,099 
OutboundConnection.java:1164 - 
/127.0.0.2:7000(/127.0.0.2:49763)->/127.0.0.2:7000-SMALL_MESSAGES-a9302b2e 
successfully connected, version = 13, framing = CRC, encryption = unencrypted
WARN  [InternalMetadataStage:5] 2023-03-17 11:55:34,100 NoSpamLogger.java:108 - 
Not currently a member of the CMS
INFO  [Messaging-EventLoop-3-13] 2023-03-17 11:55:34,102 
InboundConnectionInitiator.java:567 - 
/127.0.0.2:7000(/127.0.0.2:49764)->/127.0.0.2:7000-URGENT_MESSAGES-f887f6fa 
messaging connection established, version = 13, framing = CRC, encryption =
 unencrypted
INFO  [Messaging-EventLoop-3-11] 2023-03-17 11:55:34,102 
OutboundConnection.java:1164 - 
/127.0.0.2:7000(/127.0.0.2:49764)->/127.0.0.2:7000-URGENT_MESSAGES-5cd0c637 
successfully connected, version = 13, framing = CRC, encryption = unencrypted
ERROR [InternalResponseStage:4] 2023-03-17 11:55:49,237 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.1:7000, /127.0.0.2:7000, /
127.0.0.3:7000, /127.0.0.1:7000], checkLive=false}
WARN  [InternalMetadataStage:8] 2023-03-17 11:55:49,394 NoSpamLogger.java:108 - 
Not currently a member of the CMS
WARN  [Messaging-EventLoop-3-1] 2023-03-17 11:56:04,636 NoSpamLogger.java:108 - 
/127.0.0.2:7000->/127.0.0.1:7000-SMALL_MESSAGES-[no-channel] dropping message 
of type TCM_REPLAY_REQ whose timeout expired before reaching the network
ERROR [InternalResponseStage:5] 2023-03-17 11:56:04,637 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.3:7000, /127.0.0.1:7000, /
127.0.0.2:7000, /127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.3:7000, 
/127.0.0.1:7000], checkLive=false}
WARN  [InternalMetadataStage:11] 2023-03-17 11:56:04,892 NoSpamLogger.java:108 
- Not currently a member of the CMS
...
ERROR [InternalResponseStage:6] 2023-03-17 11:56:20,335 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.1:7000], checkLive=false}
WARN  [InternalMetadataStage:14] 2023-03-17 11:56:20,391 NoSpamLogger.java:108 
- Not currently a member of the CMS
ERROR [InternalResponseStage:7] 2023-03-17 11:56:21,750 
RemoteProcessor.java:164 - Got error from /127.0.0.3:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.1:7000, 
/127.0.0.2:7000, /127.0.0.1:7000, /
127.0.0.2:7000, /127.0.0.3:7000, /127.0.0.1:7000, /127.0.0.2:7000, 
/127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.3:7000, /127.0.0.3:7000], 
checkLive=false}
WARN  [Messaging-EventLoop-3-1] 2023-03-17 11:56:35,535 NoSpamLogger.java:108 - 
/127.0.0.2:7000->/127.0.0.1:7000-SMALL_MESSAGES-[no-channel] dropping message 
of type TCM_REPLAY_REQ whose timeout expired before reaching the network
ERROR [InternalResponseStage:8] 2023-03-17 11:56:35,537 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.1:7000, /127.0.0.2:7000, /
127.0.0.3:7000, /127.0.0.1:7000], checkLive=false}
WARN  [InternalMetadataStage:17] 2023-03-17 11:56:35,693 NoSpamLogger.java:108 
- Not currently a member of the CMS
ERROR [InternalResponseStage:9] 2023-03-17 11:56:37,135 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.1:7000, /127.0.0.2:7000, /
127.0.0.3:7000, /127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.1:7000, 
/127.0.0.2:7000, /127.0.0.3:7000, /127.0.0.3:7000, /127.0.0.1:7000], 
checkLive=false}
WARN  [InternalMetadataStage:20] 2023-03-17 11:56:37,540 NoSpamLogger.java:108 
- Not currently a member of the CMS
ERROR [InternalResponseStage:10] 2023-03-17 11:56:50,935 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.3:7000, /127.0.0.1:7000,
/127.0.0.2:7000, /127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.3:7000, 
/127.0.0.1:7000], checkLive=false}
WARN  [InternalMetadataStage:23] 2023-03-17 11:56:51,191 NoSpamLogger.java:108 
- Not currently a member of the CMS
{noformat}

...and ends here:

{noformat}
ERROR [InternalResponseStage:11] 2023-03-17 11:56:53,036 
RemoteProcessor.java:164 - Got error from /127.0.0.1:7000: TIMEOUT when sending 
TCM_REPLAY_REQ, retrying on CandidateIterator{candidates=[/127.0.0.2:7000, 
/127.0.0.3:7000, /127.0.0.1:7000,
/127.0.0.2:7000, /127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.3:7000, 
/127.0.0.3:7000, /127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.1:7000, 
/127.0.0.2:7000, /127.0.0.3:7000, /127.0.0.1:7000], checkLive=false}
Exception (java.lang.IllegalStateException) encountered during startup: Could 
not succeed sending TCM_REPLAY_REQ to 
CandidateIterator{candidates=[/127.0.0.2:7000, /127.0.0.3:7000, 
/127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.1:7000, /127.0.0.2:7000
, /127.0.0.3:7000, /127.0.0.3:7000, /127.0.0.1:7000, /127.0.0.2:7000, 
/127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.3:7000, /127.0.0.1:7000], 
checkLive=false} after 10 tries
ERROR [main] 2023-03-17 11:56:53,546 CassandraDaemon.java:929 - Exception 
encountered during startup
java.lang.IllegalStateException: Could not succeed sending TCM_REPLAY_REQ to 
CandidateIterator{candidates=[/127.0.0.2:7000, /127.0.0.3:7000, 
/127.0.0.1:7000, /127.0.0.2:7000, /127.0.0.1:7000, /127.0.0.2:7000, 
/127.0.0.3:7000, /127.0.0.3:7000, /12
7.0.0.1:7000, /127.0.0.2:7000, /127.0.0.1:7000, /127.0.0.2:7000, 
/127.0.0.3:7000, /127.0.0.1:7000], checkLive=false} after 10 tries
        at 
org.apache.cassandra.tcm.RemoteProcessor.sendWithCallback(RemoteProcessor.java:181)
        at 
org.apache.cassandra.tcm.RemoteProcessor.replayAndWait(RemoteProcessor.java:118)
        at 
org.apache.cassandra.tcm.ClusterMetadataService$SwitchableProcessor.replayAndWait(ClusterMetadataService.java:577)
        at 
org.apache.cassandra.tcm.Startup.initializeForDiscovery(Startup.java:149)
        at org.apache.cassandra.tcm.Startup.initialize(Startup.java:84)
        at org.apache.cassandra.tcm.Startup.initialize(Startup.java:59)
        at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:267)
        at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:777)
        at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:907)
...
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to