[
https://issues.apache.org/jira/browse/CASSANDRA-18564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758678#comment-17758678
]
Andres de la Peña commented on CASSANDRA-18564:
-----------------------------------------------
The test seems to pass in 4.1:
[https://app.circleci.com/pipelines/github/adelapena/cassandra/3141/workflows/5e7f85e7-ce66-4e8a-af8d-23f7de9d2205]
It only fails in 5.0 and trunk. This test has always been quite long and prone
to timeouts. The new v50 and v51 branches have almost doubled the number of
upgrade paths, and this probably is what is producing the timeouts.
[Increasing {{read_request_timeout}} and {{write_request_timeout}} to 30
seconds|https://github.com/apache/cassandra/commit/d7775660bfe94ab2353faab95a4a7cd0e2cf79ee]
seems to be enough to make the test survive 500 runs:
https://app.circleci.com/pipelines/github/adelapena/cassandra/3142/workflows/0c680c43-c988-4fa5-8e05-7ab81aea8ae1
However, increasing the timeouts to one minute seems to risk a CircleCI
timeout:
https://app.circleci.com/pipelines/github/adelapena/cassandra/3143/workflows/66ca0ccc-f6a0-4ba6-9376-59b9b03b57fb
The tricky bit with this test is that it runs both queries that should
succeeded and queries that should timeout. So, if we increase the timeout
thresholds for the queries that should succeed, then the queries that should
timeout make the test incredibly slow.
Maybe we can live with those 30s timeouts, even if they make the test quite
expensive. However, it occurs to me that we could relax the conditions of the
test to only verify the queries that should succeed with the expected
consistency levels and down replicas. We could simply skip testing the cases
where the query should timeout. After all, those expected timeouts due to down
replicas seem indistinguishable from the timeouts produced by an overloaded CI
environment. wdyt?
> Test Failure:
> MixedModeAvailabilityV30AllOneTest.testAvailabilityCoordinatorUpgraded
> ------------------------------------------------------------------------------------
>
> Key: CASSANDRA-18564
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18564
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/java
> Reporter: Andres de la Peña
> Assignee: Andres de la Peña
> Priority: Normal
> Fix For: 5.0.x, 5.x
>
>
> The JVM upgrade dtest
> {{MixedModeAvailabilityV3XAllOneTest.testAvailabilityCoordinatorUpgraded}}
> seems to be flaky at least in {{trunk}}:
> {code}
> junit.framework.AssertionFailedError: Error in test '4.0.11 -> [5.0]' while
> upgrading to '5.0'; successful upgrades []
> at
> org.apache.cassandra.distributed.upgrade.UpgradeTestBase$TestCase.run(UpgradeTestBase.java:348)
> at
> org.apache.cassandra.distributed.upgrade.MixedModeAvailabilityTestBase.testAvailability(MixedModeAvailabilityTestBase.java:154)
> at
> org.apache.cassandra.distributed.upgrade.MixedModeAvailabilityTestBase.testAvailabilityCoordinatorUpgraded(MixedModeAvailabilityTestBase.java:74)
> Caused by: java.lang.AssertionError: Unexpected error while reading in case
> write-read consistency ALL-ONE with upgraded coordinator and 2 nodes down:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
> received only 0 responses.
> at
> org.apache.cassandra.distributed.upgrade.MixedModeAvailabilityTestBase.lambda$testAvailability$6(MixedModeAvailabilityTestBase.java:145)
> at
> org.apache.cassandra.distributed.upgrade.UpgradeTestBase$TestCase.run(UpgradeTestBase.java:339)
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation
> timed out - received only 0 responses.
> at
> org.apache.cassandra.service.reads.ReadCallback.awaitResults(ReadCallback.java:162)
> at
> org.apache.cassandra.service.reads.AbstractReadExecutor.awaitResponses(AbstractReadExecutor.java:387)
> at
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:2124)
> at
> org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1995)
> at
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1873)
> at
> org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:1286)
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:364)
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:293)
> at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:105)
> at
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:122)
> at
> org.apache.cassandra.distributed.impl.Coordinator.unsafeExecuteInternal(Coordinator.java:103)
> at
> org.apache.cassandra.distributed.impl.Coordinator.lambda$executeWithResult$0(Coordinator.java:66)
> at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
> at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> This has failed 143 times in 500 iterations of this CircleCI run:
> *
> https://app.circleci.com/pipelines/github/adelapena/cassandra/2927/workflows/fcd1cd60-826b-484a-8e81-d3ba640f7de9/jobs/47659/tests
> The failure has also recently appeared on Jenkins too:
> *
> https://ci-cassandra.apache.org/job/Cassandra-trunk/1585/testReport/org.apache.cassandra.distributed.upgrade/MixedModeAvailabilityV3XAllOneTest/testAvailabilityCoordinatorUpgraded__jdk11/
> Given that the failure has just appeared on Jenkins and it fails relatively
> easily on CircleCI, it's likely that it has been broken by a very recent
> change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]