[jira] [Commented] (CASSANDRA-19651) idealCLWriteLatency metric reports the worst response time instead of the time when ideal CL is satisfied

Dmitry Konstantinov (Jira) Wed, 21 Aug 2024 13:33:06 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875658#comment-17875658
 ]


Dmitry Konstantinov commented on CASSANDRA-19651:
-------------------------------------------------

Thank you for your support!

Regarding the random exception in 
distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest-_jdk17
{code:java}
Cannot invoke 
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
 because "state" is null-java.lang.NullPointerException: Cannot invoke 
"org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)"
 because "state" is null at 
org.apache.cassandra.distributed.action.GossipHelper$PullSchemaFrom.lambda$accept$6adea493$1(GossipHelper.java:245)
 at 
org.apache.cassandra.distributed.impl.IsolatedExecutor.lambda$async$10(IsolatedExecutor.java:156)
 at org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124) at 
org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Thread.java:840) {code}
I can create a separate ticket to make the test less flaky by adding a retry 
loop, like this:
{code:java}
EndpointState state = waitForValue(() -> 
Gossiper.instance.getEndpointStateForEndpoint(endpoint), Objects::nonNull, 
10_000);
...

private static<V> V waitForValue(Callable<V> valueProvider, Predicate<V> 
condition, int timeoutMs) {
        V value;
        int sleptTimeMs = 0;
        int sleepPeriodMs = 50;
        try
        {
            do
            {
                value = valueProvider.call();
                if (condition.test(value) || sleptTimeMs >= timeoutMs)
                {
                    return value;
                }
                sleptTimeMs += sleepPeriodMs;
                Thread.sleep(sleepPeriodMs);
            }
            while (true);
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }
{code}
[~aweisberg] what do think - does it make sense?

> idealCLWriteLatency metric reports the worst response time instead of the 
> time when ideal CL is satisfied
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Observability
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 4.0.14, 5.0.1, 5.1, 4.1.7
>
>         Attachments: 
> ci_summary-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.html, 
> ci_summary-cassandra-4.1-1ed312f881c0c170c8833ff9fbf397ab8fc625cc.html, 
> ci_summary-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.html, 
> ci_summary-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.html, 
> result_details-cassandra-4.0-a75f6c3e81f677e50c0a0d467dd5dad672f923e3.tar.gz, 
> result_details-cassandra-5.0-009f2982ac88d9c9bc0a7a7d29220f055aa7f11e.tar.gz, 
> result_details-trunk-da68729322515b4a7a698b73a0154ecdeb3abf39.tar.gz, 
> select-junit-tests-rerun-4.1.zip
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> org.apache.cassandra.service.AbstractWriteResponseHandler:
> {code:java}
> private final void decrementResponseOrExpired()
> {
>     int decrementedValue = responsesAndExpirations.decrementAndGet();
>     if (decrementedValue == 0)
>     {
>         // The condition being signaled is a valid proxy for the CL being 
> achieved
>         // Only mark it as failed if the requested CL was achieved.
>         if (!condition.isSignalled() && requestedCLAchieved)
>         {
>             replicaPlan.keyspace().metric.writeFailedIdealCL.inc();
>         }
>         else
>         {
>             
> replicaPlan.keyspace().metric.idealCLWriteLatency.addNano(nanoTime() - 
> queryStartNanoTime);
>         }
>     }
> } {code}
> Actual result: responsesAndExpirations is a total number of replicas across 
> all DCs which does not depend on the ideal CL, so the metric value for 
> replicaPlan.keyspace().metric.idealCLWriteLatency is updated when we get the 
> latest response/timeout for all replicas.
> Expected result: replicaPlan.keyspace().metric.idealCLWriteLatency is updated 
> when we get enough responses from replicas according to the ideal CL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19651) idealCLWriteLatency metric reports the worst response time instead of the time when ideal CL is satisfied

Reply via email to