[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768255#comment-17768255
 ] 

Cameron Zemek commented on CASSANDRA-18866:
-------------------------------------------

{noformat}
pytest --count=500 --cassandra-dir=/home/grom/dev/cassandra-instaclustr 
transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup{noformat}
500/500 passes.

 
{noformat}
$ rg 'Resending'
1695355675403_test_move_forwards_between_and_cleanup[27-500]/node4_debug.log
1263:DEBUG [InternalResponseStage:1] 2023-09-22 14:07:06,461 Gossiper.java:1390 
- Resending ECHO_REQ to 
/127.0.0.2:70001695362768506_test_move_forwards_between_and_cleanup[74-500]/node1_debug.log
1038:DEBUG [InternalResponseStage:1] 2023-09-22 16:05:20,772 Gossiper.java:1390 
- Resending ECHO_REQ to /127.0.0.2:7000
1695362768506_test_move_forwards_between_and_cleanup[74-500]/node1_debug.log: 
WARNING: stopped searching binary file after match (found "\0" byte around 
offset 
329646)1695403170261_test_move_forwards_between_and_cleanup[342-500]/node1_debug.log
1029:DEBUG [InternalResponseStage:1] 2023-09-23 03:18:41,126 Gossiper.java:1390 
- Resending ECHO_REQ to /127.0.0.2:7000
1695403170261_test_move_forwards_between_and_cleanup[342-500]/node1_debug.log: 
WARNING: stopped searching binary file after match (found "\0" byte around 
offset 
331373)1695366089957_test_move_forwards_between_and_cleanup[96-500]/node4_debug.log
1275:DEBUG [InternalResponseStage:1] 2023-09-22 17:00:41,140 Gossiper.java:1390 
- Resending ECHO_REQ to 
/127.0.0.2:70001695422554318_test_move_forwards_between_and_cleanup[471-500]/node4_debug.log
1293:DEBUG [InternalResponseStage:1] 2023-09-23 08:41:45,750 Gossiper.java:1390 
- Resending ECHO_REQ to /127.0.0.2:7000{noformat}
So the retry happens 1% of the time with this test.

> Node sends multiple inflight echos
> ----------------------------------
>
>                 Key: CASSANDRA-18866
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Cameron Zemek
>            Priority: Normal
>         Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to