[ https://issues.apache.org/jira/browse/CASSANDRA-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ariel Weisberg updated CASSANDRA-19332: --------------------------------------- Attachment: ci_summary-cassandra-4.0-86ccfe7593c4ae238fa661d5fd1fcb6ec09de60c.html ci_summary-cassandra-4.1-6a76f36c4eeafe8c81c25aa71c63c19177b31d7d.html ci_summary-cassandra-5.0-f37f66f18c541573ae7d16f1e94ca625601785c6.html ci_summary-trunk-3b95441221a419f06242d0193e06cb4c99861253.html > Dropwizard Meter causes timeouts when infrequently used > ------------------------------------------------------- > > Key: CASSANDRA-19332 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19332 > Project: Apache Cassandra > Issue Type: Bug > Components: Observability/Metrics > Reporter: Ariel Weisberg > Assignee: Ariel Weisberg > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: > ci_summary-cassandra-4.0-86ccfe7593c4ae238fa661d5fd1fcb6ec09de60c.html, > ci_summary-cassandra-4.1-6a76f36c4eeafe8c81c25aa71c63c19177b31d7d.html, > ci_summary-cassandra-5.0-f37f66f18c541573ae7d16f1e94ca625601785c6.html, > ci_summary-trunk-3b95441221a419f06242d0193e06cb4c99861253.html > > > Observed instances of timeouts on clusters with long uptime and infrequently > used tables and possibly just request paths such as not using CAS for large > fractions of a year. > CAS seems to be more severely impacted because it has more metrics in the > request path such as latency measurements for prepare, propose, and the read > from the underlying table. > Tracing showed ~600-800 milliseconds for these operations in between the > “appending to memtable” and “sending a response” events. Reads had a delay > between finishing the construction of the iterator and sending the read > response. > Stack traces dumped every 100 milliseconds using {{sjk}} shows that in > prepare and propose a lot of time was being spent in > {{{}Meter.tickIfNecessary{}}}. > {code:java} > Thread [2537] RUNNABLE at 2024-01-25T21:14:48.218 - MutationStage-2 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:71) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.prepare(PaxosState.java:92) > Thread [2539] RUNNABLE at 2024-01-25T21:14:48.520 - MutationStage-4 > com.codahale.metrics.Meter.tickIfNecessary(Meter.java:72) > com.codahale.metrics.Meter.mark(Meter.java:55) > com.codahale.metrics.Meter.mark(Meter.java:46) > com.codahale.metrics.Timer.update(Timer.java:150) > com.codahale.metrics.Timer.update(Timer.java:86) > org.apache.cassandra.metrics.LatencyMetrics.addNano(LatencyMetrics.java:159) > org.apache.cassandra.service.paxos.PaxosState.propose(PaxosState.java:127){code} > {{tickIfNecessary}} does a linear amount of work proportional to the time > since the last time the metric was updated/read/created and this can actually > take a measurable amount of time even in a tight loop. On my M2 MBP it was > 1.5 milliseconds for a day, ~200 days took ~74 milliseconds. Before it warmed > up it was 140 milliseconds. > A quick fix is to schedule a task to read all the meters once a day so it > isn’t done in the request path and we have a more incremental amount to > process at a time. > Also observed that {{tickIfNecessary}} is not 100% thread safe in that if it > takes longer than 5 seconds to run the loop it can end up with multiple > threads attempting to run the loop at once and then they will concurrently > run {{EWMA.tick}} which probably results in some ticks not being performed. > This issue is still present in the latest version of {{Metrics}} if using > {{{}EWMA{}}}, but {{SlidingWindowTimeAverages}} looks like it has a bounded > amount of work required to tick. Switching would change how our metrics work > since the two don't have the same behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org