[ 
https://issues.apache.org/jira/browse/CASSANDRA-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17913856#comment-17913856
 ] 

Dmitry Konstantinov commented on CASSANDRA-20132:
-------------------------------------------------

A concern have come to my mind regarding an overhead of this new metric 
collection: we do an iteration over cells for each row, there is a 
multiplication of efforts here: N rows x M cells. So, I have decided to measure 
it.

I have executed the following cassandra stress read test:
* It is CPU-bound (the worst case, it is clear that if we are IO bound the 
overhead is much less visible)
* 10-row single partition read
* without tombstones
* single-node cluster

{code:java}
./tools/bin/cassandra-stress "user profile=./profile.yaml 
ops(partition-select=1) n=10m" -rate threads=100  -node somenode {code}
Data were generated before using:
{code}
./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
ops(insert=1) n=10m" -rate threads=100  -node rtc2vm085cn
{code}
Test profile:  [^profile.yaml] 
I have done 3 runs and selected a median values for the results.

Without the new logic:
{code}
Results:
Op rate                   :   77,602 op/s  [partition-select: 77,602 op/s]
Partition rate            :   77,602 pk/s  [partition-select: 77,602 pk/s]
Row rate                  :  776,020 row/s [partition-select: 776,020 row/s]
Latency mean              :    1.3 ms [partition-select: 1.3 ms]
Latency median            :    1.1 ms [partition-select: 1.1 ms]
Latency 95th percentile   :    2.5 ms [partition-select: 2.5 ms]
Latency 99th percentile   :    3.7 ms [partition-select: 3.7 ms]
Latency 99.9th percentile :    9.6 ms [partition-select: 9.6 ms]
Latency max               :   79.1 ms [partition-select: 79.1 ms]
Total partitions          : 10,000,000 [partition-select: 10,000,000]
Total errors              :          0 [partition-select: 0]
Total GC count            : 82
Total GC memory           : 391.943 GiB
Total GC time             :    1.8 seconds
Avg GC time               :   22.4 ms
StdDev GC time            :    2.3 ms
Total operation time      : 00:02:08
{code}

Without cell tombstones counting:
{code}
Results:
Op rate                   :   76,848 op/s  [partition-select: 76,848 op/s]
Partition rate            :   76,848 pk/s  [partition-select: 76,848 pk/s]
Row rate                  :  768,484 row/s [partition-select: 768,484 row/s]
Latency mean              :    1.3 ms [partition-select: 1.3 ms]
Latency median            :    1.1 ms [partition-select: 1.1 ms]
Latency 95th percentile   :    2.6 ms [partition-select: 2.6 ms]
Latency 99th percentile   :    3.8 ms [partition-select: 3.8 ms]
Latency 99.9th percentile :    8.8 ms [partition-select: 8.8 ms]
Latency max               :   33.5 ms [partition-select: 33.5 ms]
Total partitions          : 10,000,000 [partition-select: 10,000,000]
Total errors              :          0 [partition-select: 0]
Total GC count            : 83
Total GC memory           : 396.704 GiB
Total GC time             :    1.8 seconds
Avg GC time               :   22.1 ms
StdDev GC time            :    2.2 ms
Total operation time      : 00:02:10
{code}

With cell tombstones counting:
{code}
Results:
Op rate                   :   73,038 op/s  [partition-select: 73,038 op/s]
Partition rate            :   73,038 pk/s  [partition-select: 73,038 pk/s]
Row rate                  :  730,383 row/s [partition-select: 730,383 row/s]
Latency mean              :    1.4 ms [partition-select: 1.4 ms]
Latency median            :    1.2 ms [partition-select: 1.2 ms]
Latency 95th percentile   :    2.7 ms [partition-select: 2.7 ms]
Latency 99th percentile   :    4.0 ms [partition-select: 4.0 ms]
Latency 99.9th percentile :    9.9 ms [partition-select: 9.9 ms]
Latency max               :   35.2 ms [partition-select: 35.2 ms]
Total partitions          : 10,000,000 [partition-select: 10,000,000]
Total errors              :          0 [partition-select: 0]
Total GC count            : 85
Total GC memory           : 406.237 GiB
Total GC time             :    1.9 seconds
Avg GC time               :   22.0 ms
StdDev GC time            :    2.0 ms
Total operation time      : 00:02:16
{code}

CPU async profiler flamegraph:  [^cpu_profile_select_cell.html] 

So, there is some not huge but visible overhead for counting potential cell 
tombstones. In majority of cases purgable cell tombstones are not causing 
issues, purgable row tombstones are, so it makes sense to stop in granularity 
on row level, to make it more flexible: I am planning to introduce a 
configuration option: purgeable_tobmstones_metric_granularity: disabled | row 
(default) | cell


> Add metric and tracing event for scanned purgeable tombstones
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-20132
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20132
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Other
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.0.x
>
>         Attachments: cpu_profile_select_cell.html, profile.yaml, 
> trace_sample.txt
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently, Cassandra can read data from SSTables with tombstones and later 
> drop them silently if the tombstones are older than gc_grace_seconds (aka 
> purgeable tombstones). Such tombstones are not visible via 
> readTombstoneHistogram metric, not reported in Cassandra logs if tombstone 
> threshold is crossed and not mentioned in tracing events. As a result if a 
> partition has a lot of purgeable tombstones we may have a slow read query 
> without any signs why it is it slow. Example: [^trace_sample.txt]
> This suggested improvement adds:
> 1) a new metric which tracks number of such tombstones: 
> PurgeableTombstoneScannedHistogram
> 2) a new tracing event: "Read {} purgeable tombstone cells" if number of such 
> tombstones > 0
>  
> Implementation notes: the logic of new withMetricsRecording iterator is an 
> adjusted version of existing withMetricsRecording iterator



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to