[
https://issues.apache.org/jira/browse/CASSANDRA-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926966#comment-17926966
]
Dmitry Konstantinov edited comment on CASSANDRA-20250 at 2/13/25 9:59 PM:
--------------------------------------------------------------------------
Updates:
* adjust Timer instances to use new Meter implementation - DONE
* move the ticks out of mark logic to a background thread - DONE
* try to move average to another non-thread local array to improve
fetching/caching during the bulk update - in progress
* add metrics id release logic when metrics are unregistered from the registry
- partially done, need to think about safety in some concurrency scenarios
* code cleanup: remove LazySetArrayThreadLocalMetrics, rename
PiggybackArrayThreadLocalMetrics to ThreadLocalMetrics - DONE
* run a e2e stress test for the current logic in the branch - DONE
Additionally I have transferred codahale Histogram to the thread local metrics
usage too - it had AtomicLong for count as well.
!Histogram_AtomicLong.png|width=500!
Write stress test results:
{code:java}
Results:
Op rate : 163,892 op/s [WRITE: 163,892 op/s]
Partition rate : 163,892 pk/s [WRITE: 163,892 pk/s]
Row rate : 163,892 row/s [WRITE: 163,892 row/s]
Latency mean : 0.6 ms [WRITE: 0.6 ms]
Latency median : 0.5 ms [WRITE: 0.5 ms]
Latency 95th percentile : 0.9 ms [WRITE: 0.9 ms]
Latency 99th percentile : 1.3 ms [WRITE: 1.3 ms]
Latency 99.9th percentile : 7.7 ms [WRITE: 7.7 ms]
Latency max : 111.1 ms [WRITE: 111.1 ms]
Total partitions : 10,000,000 [WRITE: 10,000,000]
Total errors : 0 [WRITE: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:01
{code}
The current flamegraph: [^5.1_tl4_profile_cpu.html]
"metrics" weight is 4.69% now (it was 8.65%)
where "ThreadLocalMetrics" weight: 0.74%, "Reservoir": 2.87%
Regarding e2e throughput I feel that probably CASSANDRA-20226 and
CASSANDRA-20310 bottlenecks are starting to play a major role..
Regarding Reservoir logic - I am checking if we can squeeze a bit more from the
current version.
was (Author: dnk):
Updates:
* adjust Timer instances to use new Meter implementation - DONE
* move the ticks out of mark logic to a background thread - DONE
* try to move average to another non-thread local array to improve
fetching/caching during the bulk update - in progress
* add metrics id release logic when metrics are unregistered from the registry
- partially done, need to think about safety in some concurrency scenarios
* code cleanup: remove LazySetArrayThreadLocalMetrics, rename
PiggybackArrayThreadLocalMetrics to ThreadLocalMetrics - DONE
* run a e2e stress test for the current logic in the branch - DONE
Additionally I have transferred codahale Histogram to the thread local metrics
usage too - it had AtomicLong for count as well.
!Histogram_AtomicLong.png|width=500!
Write stress test results:
{code:java}
Results:
Op rate : 163,892 op/s [WRITE: 163,892 op/s]
Partition rate : 163,892 pk/s [WRITE: 163,892 pk/s]
Row rate : 163,892 row/s [WRITE: 163,892 row/s]
Latency mean : 0.6 ms [WRITE: 0.6 ms]
Latency median : 0.5 ms [WRITE: 0.5 ms]
Latency 95th percentile : 0.9 ms [WRITE: 0.9 ms]
Latency 99th percentile : 1.3 ms [WRITE: 1.3 ms]
Latency 99.9th percentile : 7.7 ms [WRITE: 7.7 ms]
Latency max : 111.1 ms [WRITE: 111.1 ms]
Total partitions : 10,000,000 [WRITE: 10,000,000]
Total errors : 0 [WRITE: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:01
{code}
The current flamegraph: [^5.1_tl4_profile_cpu.html]
"metrics" weight is 4.69% now (it was 8.65%)
where "ThreadLocalMetrics" weight: 0.74%, "Reservoir": 2.87%
Regarding e2e throughput I feel that partially CASSANDRA-20226 and
CASSANDRA-20310 bottlenecks are starting to play a major role..
> Provide the ability to disable specific metrics collection
> ----------------------------------------------------------
>
> Key: CASSANDRA-20250
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20250
> Project: Apache Cassandra
> Issue Type: New Feature
> Components: Observability/Metrics
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Attachments: 5.1_profile_cpu.html,
> 5.1_profile_cpu_without_metrics.html, 5.1_tl4_profile_cpu.html,
> Histogram_AtomicLong.png, async_profiler_cpu_profiles.zip,
> cpu_profile_insert.html, jmh-result.json, vmstat.log,
> vmstat_without_metrics.log
>
>
> Cassandra has a lot of metrics collected, many of them are collected per
> table, so their instance number is multiplied by number of tables. From one
> side it gives a better observability, from another side metrics are not for
> free, there is an overhead associated with them:
> 1) CPU overhead: in case of simple CPU bound load: I already see like 5.5% of
> total CPU spent for metrics in cpu framegraphs for read load and 11% for
> write load.
> Example: [^cpu_profile_insert.html] (search by "codahale" pattern). The
> framegraph is captured using Async profiler build:
> async-profiler-3.0-29ee888-linux-x64
> 2) memory overhead: we spend memory for entities used to aggregate metrics
> such as LongAdders and reservoirs + for MBeans (String concatenation within
> object names is a major cause of it, for each table+metric name combination a
> new String is created)
>
> The idea of this ticket is to allow an operator to configure a list of
> disabled metrics in cassandra.yaml, like:
> {code:java}
> disabled_metrics:
> - metric_a
> - metric_b
> {code}
> From implementation point of view I see two possible approaches (which can be
> combined):
> # Generic: when a metric is registering if it is listed in disabled_metrics
> we do not publish it via JMX and provide a noop implementation of metric
> object (such as histogram) for it.
> Logging analogy: log level check within log method
> # Specialized: for some metrics the process of value calculation is not for
> free and introduces an overhead as well, in such cases it would be useful to
> check within specific logic using an API (like: isMetricEnabled) do we need
> to do it. Example of such metric:
> ClientRequestSizeMetrics.recordRowAndColumnCountMetrics
> Logging analogy: an explicit 'if (isDebugEnabled())' condition used when a
> message parameter is expensive.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]