[ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062703#comment-18062703
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18866 at 3/4/26 1:40 PM:
-----------------------------------------------------------------------

4.0 and 4.1 seems suspicious

[Circle 
4.0|https://app.circleci.com/pipelines/github/instaclustr/cassandra/6276/workflows/108a04f6-ae54-405a-8d27-3aca24428e89]
[Circle 
4.1|https://app.circleci.com/pipelines/github/instaclustr/cassandra/6277/workflows/06d914f9-ae58-456e-bba9-6b8cbc0de6ce]

It is same set of tests, basically. That tells me this is not just tests being 
flaky but something else is going on.

There are some failures like that for 5.0 too 
https://pre-ci.cassandra.apache.org/job/cassandra-5.0/97/

I will restart 4.0 and 4.1 to see if we had a bad luck only or the failures are 
more consistent.

5.0 looks fine: https://pre-ci.cassandra.apache.org/job/cassandra-5.0/97/ 


was (Author: smiklosovic):
4.0 and 4.1 seems suspicious

[Circle 
4.0|https://app.circleci.com/pipelines/github/instaclustr/cassandra/6276/workflows/108a04f6-ae54-405a-8d27-3aca24428e89]
[Circle 
4.1|https://app.circleci.com/pipelines/github/instaclustr/cassandra/6277/workflows/06d914f9-ae58-456e-bba9-6b8cbc0de6ce]

It is same set of tests, basically. That tells me this is not just tests being 
flaky but something else is going on.

There are some failures like that for 5.0 too 
https://pre-ci.cassandra.apache.org/job/cassandra-5.0/97/

I will restart 4.0 and 4.1 to see if we had a bad luck only or the failures are 
more consistent. 

> Node sends multiple inflight echos
> ----------------------------------
>
>                 Key: CASSANDRA-18866
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Cluster/Gossip
>            Reporter: Cameron Zemek
>            Assignee: Cameron Zemek
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 18866-regression.patch, CASSANDRA-18866-4.0.patch, 
> CASSANDRA-18866-4.1.patch, CASSANDRA-18866-5.0.patch, duplicates.log, echo.log
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to