[jira] [Commented] (CASSANDRA-18564) Test Failure: MixedModeAvailabilityV30AllOneTest.testAvailabilityCoordinatorUpgraded

Jira Thu, 24 Aug 2023 11:03:07 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758678#comment-17758678
 ]


Andres de la Peña commented on CASSANDRA-18564:
-----------------------------------------------

The test seems to pass in 4.1: 
[https://app.circleci.com/pipelines/github/adelapena/cassandra/3141/workflows/5e7f85e7-ce66-4e8a-af8d-23f7de9d2205]

It only fails in 5.0 and trunk. This test has always been quite long and prone 
to timeouts. The new v50 and v51 branches have almost doubled the number of 
upgrade paths, and this probably is what is producing the timeouts.

[Increasing {{read_request_timeout}} and {{write_request_timeout}} to 30 
seconds|https://github.com/apache/cassandra/commit/d7775660bfe94ab2353faab95a4a7cd0e2cf79ee]
 seems to be enough to make the test survive 500 runs: 
https://app.circleci.com/pipelines/github/adelapena/cassandra/3142/workflows/0c680c43-c988-4fa5-8e05-7ab81aea8ae1

However, increasing the timeouts to one minute seems to risk a CircleCI 
timeout: 
https://app.circleci.com/pipelines/github/adelapena/cassandra/3143/workflows/66ca0ccc-f6a0-4ba6-9376-59b9b03b57fb

The tricky bit with this test is that it runs both queries that should 
succeeded and queries that should timeout. So, if we increase the timeout 
thresholds for the queries that should succeed, then the queries that should 
timeout make the test incredibly slow.

Maybe we can live with those 30s timeouts, even if they make the test quite 
expensive. However, it occurs to me that we could relax the conditions of the 
test to only verify the queries that should succeed with the expected 
consistency levels and down replicas. We could simply skip testing the cases 
where the query should timeout. After all, those expected timeouts due to down 
replicas seem indistinguishable from the timeouts produced by an overloaded CI 
environment. wdyt?

> Test Failure: 
> MixedModeAvailabilityV30AllOneTest.testAvailabilityCoordinatorUpgraded
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18564
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18564
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Andres de la Peña
>            Assignee: Andres de la Peña
>            Priority: Normal
>             Fix For: 5.0.x, 5.x
>
>
> The JVM upgrade dtest 
> {{MixedModeAvailabilityV3XAllOneTest.testAvailabilityCoordinatorUpgraded}} 
> seems to be flaky at least in {{trunk}}:
> {code}
> junit.framework.AssertionFailedError: Error in test '4.0.11 -> [5.0]' while 
> upgrading to '5.0'; successful upgrades []
>       at 
> org.apache.cassandra.distributed.upgrade.UpgradeTestBase$TestCase.run(UpgradeTestBase.java:348)
>       at 
> org.apache.cassandra.distributed.upgrade.MixedModeAvailabilityTestBase.testAvailability(MixedModeAvailabilityTestBase.java:154)
>       at 
> org.apache.cassandra.distributed.upgrade.MixedModeAvailabilityTestBase.testAvailabilityCoordinatorUpgraded(MixedModeAvailabilityTestBase.java:74)
> Caused by: java.lang.AssertionError: Unexpected error while reading in case 
> write-read consistency ALL-ONE with upgraded coordinator and 2 nodes down: 
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
> received only 0 responses.
>       at 
> org.apache.cassandra.distributed.upgrade.MixedModeAvailabilityTestBase.lambda$testAvailability$6(MixedModeAvailabilityTestBase.java:145)
>       at 
> org.apache.cassandra.distributed.upgrade.UpgradeTestBase$TestCase.run(UpgradeTestBase.java:339)
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation 
> timed out - received only 0 responses.
>       at 
> org.apache.cassandra.service.reads.ReadCallback.awaitResults(ReadCallback.java:162)
>       at 
> org.apache.cassandra.service.reads.AbstractReadExecutor.awaitResponses(AbstractReadExecutor.java:387)
>       at 
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:2124)
>       at 
> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1995)
>       at 
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1873)
>       at 
> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1286)
>       at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:364)
>       at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:293)
>       at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:105)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:122)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:103)
>       at 
> org.apache.cassandra.distributed.impl.Coordinator.lambda$executeWithResult$0(Coordinator.java:66)
>       at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
>       at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.lang.Thread.run(Thread.java:750)
> {code}
> This has failed 143 times in 500 iterations of this CircleCI run:
> * 
> https://app.circleci.com/pipelines/github/adelapena/cassandra/2927/workflows/fcd1cd60-826b-484a-8e81-d3ba640f7de9/jobs/47659/tests
> The failure has also recently appeared on Jenkins too:
> * 
> https://ci-cassandra.apache.org/job/Cassandra-trunk/1585/testReport/org.apache.cassandra.distributed.upgrade/MixedModeAvailabilityV3XAllOneTest/testAvailabilityCoordinatorUpgraded__jdk11/
> Given that the failure has just appeared on Jenkins and it fails relatively 
> easily on CircleCI, it's likely that it has been broken by a very recent 
> change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-18564) Test Failure: MixedModeAvailabilityV30AllOneTest.testAvailabilityCoordinatorUpgraded

Reply via email to