[ 
https://issues.apache.org/jira/browse/CASSANDRA-21189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061011#comment-18061011
 ] 

Sam Lightfoot edited comment on CASSANDRA-21189 at 2/25/26 5:20 PM:
--------------------------------------------------------------------

The triggering error that causes a chain of port errors is from a Paxos commit 
that times out:

 
{code:java}
Caused an ERROR
[2026-02-25T09:27:27.026Z] [junit-timeout] 
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Can 
not commit transformation: "SERVER_ERROR"(Could not perform commit; policy 
Retry{remainingMs=0, attempts=2} gave up). {code}
This timeout is configured on the cluster builder to 1 second (overriding from 
10s default)

 

 
{code:java}
try (Cluster cluster = builder().withNodes(3)
                                .appendConfig(cfg -> 
cfg.set("progress_barrier_timeout", "5000ms")
.set("request_timeout", "1000ms")
.set("progress_barrier_backoff", "100ms")
{ {code}
The request_timeout effectively becomes the ceiling for the entire Paxos 
commit, and because a successful error response is returned, it does not get 
retried within the cms_await_timeout budget (significantly larger).

I think a fairly safe option is to increase the 1000ms request_timeout from the 
three tests where it is set, or remove it completely, given the resource 
constraints of CI.

 


was (Author: JIRAUSER302824):
Appears to be due to FailedBootstrapTest not cleaning up properly, with 
InProgressSequenceCoordinationTest starting immediately after. Running these 
two tests sequentially reproduces the issue.

> Fix flaky DTest: InProgressSequenceCoordinationTest
> ---------------------------------------------------
>
>                 Key: CASSANDRA-21189
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21189
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.1
>
>
> There's a race condition between cluster closing and startup between test 
> scenarios due to lack of thread lifecycle handling. The spawned thread should 
> be joined before the test finishes to prevent the 'in-use port' errors.
> Affects
>  * bootstrapProgressTest
>  * decommissionProgressTest
>  * replacementProgressTest
> Adopt the same pattern as GossipTest with try-finally thread joining.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to