[jira] [Commented] (CASSANDRA-19651) idealCLWriteLatency metric reports the worst response time instead of the time when ideal CL is satisfied

Dmitry Konstantinov (Jira) Sat, 17 Aug 2024 09:31:41 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874530#comment-17874530
 ]


Dmitry Konstantinov commented on CASSANDRA-19651:
-------------------------------------------------

5.0 - 3 test failures
 * distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest-_jdk17, 
flaky, local re-run is ok
{code:java}
% ant test-jvm-dtest-some 
-Dtest.name=org.apache.cassandra.distributed.test.ring.BootstrapTest 
-Dtest.methods=bootstrapUnspecifiedResumeTest
...
BUILD SUCCESSFUL
Total time: 1 minute 52 seconds
{code}
The original attached report has the following exception - it looks like a 
timing issue and we can add a limited cycle in GossipHelper to wait for a 
non-null ApplicationState...:
{code:java}
Cannot invoke 
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
 because "state" is null-java.lang.NullPointerException: Cannot invoke 
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
 because "state" is null at 
org.apache.cassandra.distributed.action.GossipHelper$PullSchemaFrom.lambda$accept$6adea493$1(GossipHelper.java:245)
 at 
org.apache.cassandra.distributed.impl.IsolatedExecutor.lambda$async$10(IsolatedExecutor.java:156)
 at org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124) at 
org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Thread.java:840)
{code}

 * bootstrap_test.TestBootstrap#test_shutdown_wiped_node_cannot_join - flaky, 
local re-run is ok
{code:java}
% pytest --cassandra-dir=/Users/dmitry/IdeaProjects/cassandra-5.0 
bootstrap_test.py::TestBootstrap::test_shutdown_wiped_node_cannot_join
...
test_shutdown_wiped_node_cannot_join passed 1 out of the required 1 times. 
Success!

===End Flaky Test Report===
============================================================= 1 passed, 192 
warnings in 166.48s (0:02:46) 
=============================================================
{code}

 * cqlsh_tests.test_cqlsh_copy.TestCqlshCopy#test_round_trip_with_rate_file - 
flaky, CASSANDRA-17322, local re-run is ok
{code:java}
% pytest --cassandra-dir=/Users/dmitry/IdeaProjects/cassandra-5.0 
cqlsh_tests/test_cqlsh_copy.py::TestCqlshCopy::test_round_trip_with_rate_file
...
test_round_trip_with_rate_file passed 1 out of the required 1 times. Success!

===End Flaky Test Report===
=================================================================== 1 passed, 
64 warnings in 54.17s 
===================================================================
{code}

> idealCLWriteLatency metric reports the worst response time instead of the 
> time when ideal CL is satisfied
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Observability
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 4.1.x, 5.0.x, 5.x
>
>         Attachments: 
> ci_summary-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.html, 
> ci_summary-cassandra-4.1-1ed312f881c0c170c8833ff9fbf397ab8fc625cc.html, 
> ci_summary-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.html, 
> ci_summary-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.html, 
> result_details-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.tar.gz, 
> result_details-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.tar.gz, 
> result_details-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.tar.gz, 
> select-junit-tests-rerun-4.1.zip
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> org.apache.cassandra.service.AbstractWriteResponseHandler:
> {code:java}
> private final void decrementResponseOrExpired()
> {
>     int decrementedValue = responsesAndExpirations.decrementAndGet();
>     if (decrementedValue == 0)
>     {
>         // The condition being signaled is a valid proxy for the CL being 
> achieved
>         // Only mark it as failed if the requested CL was achieved.
>         if (!condition.isSignalled() && requestedCLAchieved)
>         {
>             replicaPlan.keyspace().metric.writeFailedIdealCL.inc();
>         }
>         else
>         {
>             
> replicaPlan.keyspace().metric.idealCLWriteLatency.addNano(nanoTime() - 
> queryStartNanoTime);
>         }
>     }
> } {code}
> Actual result: responsesAndExpirations is a total number of replicas across 
> all DCs which does not depend on the ideal CL, so the metric value for 
> replicaPlan.keyspace().metric.idealCLWriteLatency is updated when we get the 
> latest response/timeout for all replicas.
> Expected result: replicaPlan.keyspace().metric.idealCLWriteLatency is updated 
> when we get enough responses from replicas according to the ideal CL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19651) idealCLWriteLatency metric reports the worst response time instead of the time when ideal CL is satisfied

Reply via email to