[
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938001#comment-17938001
]
Dmitry Konstantinov edited comment on CASSANDRA-20226 at 3/24/25 8:34 PM:
--------------------------------------------------------------------------
Some initial experiment results (looks promising).
h3. Configuration
Before applying the changes I have adjusted them number of flushing threads
from default 2 to 4 because after ~500-600k rows/sec flushing logic becomes a
bottleneck (memtable allocation backpressure is becoming active). I expect to
get some improvements for flushing itself in CASSANDRA-20173 and
CASSANDRA-20465.
So, the total set of changed parameters:
{code:java}
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of
async profiler
-Dio.netty.eventLoopThreads=4 # we do not them than many as 2 * CPU cores
actually
memtable_allocation_type: offheap_objects
memtable:
configurations:
skiplist:
class_name: SkipListMemtable
trie:
class_name: TrieMemtable
parameters:
shards: 32
default:
inherits: trie
commitlog_disk_access_mode: direct
native_transport_max_request_data_in_flight: 1024MiB
native_transport_max_request_data_in_flight_per_ip: 1024MiB
{code}
Compaction is enabled.
Test logic (1 partition text column, 1 clustering text column, 5 value text
columns, inserts are down using 10-row batches)
{code:java}
./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup
ops(insert=1) n=10m" -rate threads=100 -node somenode
{code}
h3. Baseline
Baseline using the configuration (recent 5.1/trunk build, commit: 10c8c042):
{code:java}
Results:
Op rate : 51,329 op/s [insert: 51,329 op/s]
Partition rate : 51,329 pk/s [insert: 51,329 pk/s]
Row rate : 513,289 row/s [insert: 513,289 row/s]
Latency mean : 1.9 ms [insert: 1.9 ms]
Latency median : 1.5 ms [insert: 1.5 ms]
Latency 95th percentile : 3.8 ms [insert: 3.8 ms]
Latency 99th percentile : 7.4 ms [insert: 7.4 ms]
Latency 99.9th percentile : 41.1 ms [insert: 41.1 ms]
Latency max : 114.8 ms [insert: 114.8 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:03:14
{code}
[^5.1_batch_baseline.html]
h3. Changes
it is a sketch as of now, not a final implementation. The changes are applied
one on top of another one, in the described order (except the padding, it is
not applied).
* Existing allocation logic optimizations:
** To reduce contention I replaced CAS loop in
MemtablePool.SubPool#tryAllocate with allocatedUpdater.addAndGet(this, size);
{code:java}
Results:
Op rate : 56,040 op/s [insert: 56,356 op/s]
Partition rate : 56,040 pk/s [insert: 56,356 pk/s]
Row rate : 560,405 row/s [insert: 563,563 row/s]
Latency mean : 1.8 ms [insert: 1.8 ms]
Latency median : 1.4 ms [insert: 1.4 ms]
Latency 95th percentile : 3.4 ms [insert: 3.4 ms]
Latency 99th percentile : 7.5 ms [insert: 7.5 ms]
Latency 99.9th percentile : 41.3 ms [insert: 41.3 ms]
Latency max : 293.6 ms [insert: 293.6 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:58
{code}
[^5.1_batch_addAndGet.html]
[https://github.com/apache/cassandra/commit/842623a89042e6d55bc86cc225d348a0db3c5666]
*
** To reduce contention I switched MemtableAllocator.SubAllocator#owns from
updates via AtomicLongFieldUpdater to LongAdder usage:
MemtableAllocator.SubAllocator#acquired(..) method used in a hot path updates
"owns" value but does not use the updated result.
{code:java}
Results:
Op rate : 90,220 op/s [insert: 90,220 op/s]
Partition rate : 90,220 pk/s [insert: 90,220 pk/s]
Row rate : 902,196 row/s [insert: 902,196 row/s]
Latency mean : 1.1 ms [insert: 1.1 ms]
Latency median : 0.9 ms [insert: 0.9 ms]
Latency 95th percentile : 1.7 ms [insert: 1.7 ms]
Latency 99th percentile : 3.1 ms [insert: 3.1 ms]
Latency 99.9th percentile : 39.6 ms [insert: 39.6 ms]
Latency max : 130.6 ms [insert: 130.6 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:50
{code}
[^5.1_batch_LongAdder.html]
[https://github.com/apache/cassandra/commit/1deef005705e64abfa17c7d3117854be00dc7189]
*
** Attempts to reduce a possible false sharing by adding padding around
MemtablePool.SubPool.allocated/reclaiming/nextClean fields as well as
MemtableAllocator.SubAllocator.state/reclaiming fields, using sub-class
approach, like:
{code:java}
public static class SubPoolPadding0
{
// why this padding style is used:
https://shipilev.net/jvm/objects-inside-out/#_observation_hierarchy_tower_padding_trick_collapse_in_jdk_15
byte p0_00, p0_01, p0_02, p0_03, p0_04, p0_05, p0_06, p0_07, p0_08,
p0_09, p0_10, p0_11, p0_12, p0_13, p0_14, p0_15;
byte p0_16, p0_17, p0_18, p0_19, p0_20, p0_21, p0_22, p0_23, p0_24,
p0_25, p0_26, p0_27, p0_28, p0_29, p0_30, p0_31;
byte p0_32, p0_33, p0_34, p0_35, p0_36, p0_37, p0_38, p0_39, p0_40,
p0_41, p0_42, p0_43, p0_44, p0_45, p0_46, p0_47;
byte p0_48, p0_49, p0_50, p0_51, p0_52, p0_53, p0_54, p0_55, p0_56,
p0_57, p0_58, p0_59, p0_60, p0_61, p0_62, p0_63;
}
public static class SubPoolPadding1 extends SubPoolPadding0
{
// total bytes allocated and reclaiming
volatile long allocated;
}
public static class SubPoolPadding2 extends SubPoolPadding1
{
byte p1_00, p1_01, p1_02, p1_03, p1_04, p1_05, p1_06, p1_07, p1_08,
p1_09, p1_10, p1_11, p1_12, p1_13, p1_14, p1_15;
byte p1_16, p1_17, p1_18, p1_19, p1_20, p1_21, p1_22, p1_23, p1_24,
p1_25, p1_26, p1_27, p1_28, p1_29, p1_30, p1_31;
byte p1_32, p1_33, p1_34, p1_35, p1_36, p1_37, p1_38, p1_39, p1_40,
p1_41, p1_42, p1_43, p1_44, p1_45, p1_46, p1_47;
byte p1_48, p1_49, p1_50, p1_51, p1_52, p1_53, p1_54, p1_55, p1_56,
p1_57, p1_58, p1_59, p1_60, p1_61, p1_62, p1_63;
}
{code}
did not help :(, results are even slightly worse.
{code:java}
Results:
Op rate : 85,575 op/s [insert: 85,575 op/s]
Partition rate : 85,575 pk/s [insert: 85,575 pk/s]
Row rate : 855,747 row/s [insert: 855,747 row/s]
Latency mean : 1.2 ms [insert: 1.2 ms]
Latency median : 0.9 ms [insert: 0.9 ms]
Latency 95th percentile : 1.9 ms [insert: 1.9 ms]
Latency 99th percentile : 3.5 ms [insert: 3.5 ms]
Latency 99.9th percentile : 39.0 ms [insert: 39.0 ms]
Latency max : 152.2 ms [insert: 152.2 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:56
{code}
[^5.1_batch_pad_allocated.html]
* Allocation batching
** A simple prototype for the first option mentioned by Benedict is added:
{quote}If we don't contend our update and no existing data is present, we can
accurately calculate the space we require.
{quote}
The current version is only for offheap_objects case now and only for adding a
new row into a memtable.
{code:java}
Results:
Op rate : 111,119 op/s [insert: 111,119 op/s]
Partition rate : 111,119 pk/s [insert: 111,119 pk/s]
Row rate : 1,111,192 row/s [insert: 1,111,192 row/s]
Latency mean : 0.9 ms [insert: 0.9 ms]
Latency median : 0.7 ms [insert: 0.7 ms]
Latency 95th percentile : 1.2 ms [insert: 1.2 ms]
Latency 99th percentile : 2.1 ms [insert: 2.1 ms]
Latency 99.9th percentile : 38.8 ms [insert: 38.8 ms]
Latency max : 163.1 ms [insert: 163.1 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:29
{code}
[^5.1_batch_alloc_batching.html]
[https://github.com/apache/cassandra/commit/23efee95e07ef169e99827954bc2d7974af3f314]
was (Author: dnk):
Some initial experiment results (looks promising).
h3. Configuration
Before applying the changes I have adjusted them number of flushing threads
from default 2 to 4 because after ~500-600k rows/sec flushing logic becomes a
bottleneck (memtable allocation backpressure is becoming active). I expect to
get some improvements for flushing itself in CASSANDRA-20173 and
CASSANDRA-20465.
So, the total set of changed parameters:
{code:java}
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of
async profiler
-Dio.netty.eventLoopThreads=4 # we do not them than many as 2 * CPU cores
actually
memtable_allocation_type: offheap_objects
memtable:
configurations:
skiplist:
class_name: SkipListMemtable
trie:
class_name: TrieMemtable
parameters:
shards: 32
default:
inherits: trie
commitlog_disk_access_mode: direct
native_transport_max_request_data_in_flight: 1024MiB
native_transport_max_request_data_in_flight_per_ip: 1024MiB
{code}
Compaction is enabled.
Test logic (1 partition text column, 1 clustering text column, 5 value text
columns, inserts are down using 10-row batches)
{code:java}
./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup
ops(insert=1) n=10m" -rate threads=100 -node somenode
{code}
h3. Baseline
Baseline using the configuration (recent 5.1/trunk build, commit: 10c8c042):
{code:java}
Results:
Op rate : 51,329 op/s [insert: 51,329 op/s]
Partition rate : 51,329 pk/s [insert: 51,329 pk/s]
Row rate : 513,289 row/s [insert: 513,289 row/s]
Latency mean : 1.9 ms [insert: 1.9 ms]
Latency median : 1.5 ms [insert: 1.5 ms]
Latency 95th percentile : 3.8 ms [insert: 3.8 ms]
Latency 99th percentile : 7.4 ms [insert: 7.4 ms]
Latency 99.9th percentile : 41.1 ms [insert: 41.1 ms]
Latency max : 114.8 ms [insert: 114.8 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:03:14
{code}
[^5.1_batch_baseline.html]
h3. Changes
it is a sketch as of now, not a final implementation. The changes are applied
one on top of another one, in the described order (except the padding, it is
not applied).
* Existing allocation logic optimizations:
** To reduce contention I replaced CAS loop in
MemtablePool.SubPool#tryAllocate with allocatedUpdater.addAndGet(this, size);
{code:java}
Results:
Op rate : 56,040 op/s [insert: 56,356 op/s]
Partition rate : 56,040 pk/s [insert: 56,356 pk/s]
Row rate : 560,405 row/s [insert: 563,563 row/s]
Latency mean : 1.8 ms [insert: 1.8 ms]
Latency median : 1.4 ms [insert: 1.4 ms]
Latency 95th percentile : 3.4 ms [insert: 3.4 ms]
Latency 99th percentile : 7.5 ms [insert: 7.5 ms]
Latency 99.9th percentile : 41.3 ms [insert: 41.3 ms]
Latency max : 293.6 ms [insert: 293.6 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:02:58
{code}
[^5.1_batch_addAndGet.html]
[https://github.com/apache/cassandra/commit/842623a89042e6d55bc86cc225d348a0db3c5666]
*
** To reduce contention I switched MemtableAllocator.SubAllocator#owns from
updates via AtomicLongFieldUpdater to LongAdder usage:
MemtableAllocator.SubAllocator#acquired(..) method used in a hot path updates
"owns" value but does not use the updated result.
{code:java}
Results:
Op rate : 90,220 op/s [insert: 90,220 op/s]
Partition rate : 90,220 pk/s [insert: 90,220 pk/s]
Row rate : 902,196 row/s [insert: 902,196 row/s]
Latency mean : 1.1 ms [insert: 1.1 ms]
Latency median : 0.9 ms [insert: 0.9 ms]
Latency 95th percentile : 1.7 ms [insert: 1.7 ms]
Latency 99th percentile : 3.1 ms [insert: 3.1 ms]
Latency 99.9th percentile : 39.6 ms [insert: 39.6 ms]
Latency max : 130.6 ms [insert: 130.6 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:50
{code}
[^5.1_batch_LongAdder.html]
[https://github.com/apache/cassandra/commit/1deef005705e64abfa17c7d3117854be00dc7189]
*
** Attempts to reduce a possible false sharing by adding padding around
MemtablePool.SubPool.allocated/reclaiming/nextClean fields as well as
MemtableAllocator.SubAllocator.state/reclaiming fields, using sub-class
approach, like:
{code:java}
public static class SubPoolPadding0
{
// why this padding style is used:
https://shipilev.net/jvm/objects-inside-out/#_observation_hierarchy_tower_padding_trick_collapse_in_jdk_15
byte p0_00, p0_01, p0_02, p0_03, p0_04, p0_05, p0_06, p0_07, p0_08, p0_09,
p0_10, p0_11, p0_12, p0_13, p0_14, p0_15;
byte p0_16, p0_17, p0_18, p0_19, p0_20, p0_21, p0_22, p0_23, p0_24,
p0_25, p0_26, p0_27, p0_28, p0_29, p0_30, p0_31;
byte p0_32, p0_33, p0_34, p0_35, p0_36, p0_37, p0_38, p0_39, p0_40,
p0_41, p0_42, p0_43, p0_44, p0_45, p0_46, p0_47;
byte p0_48, p0_49, p0_50, p0_51, p0_52, p0_53, p0_54, p0_55, p0_56,
p0_57, p0_58, p0_59, p0_60, p0_61, p0_62, p0_63;
}
public static class SubPoolPadding1 extends SubPoolPadding0
{
// total bytes allocated and reclaiming
volatile long allocated;
}
public static class SubPoolPadding2 extends SubPoolPadding1
{
byte p1_00, p1_01, p1_02, p1_03, p1_04, p1_05, p1_06, p1_07, p1_08,
p1_09, p1_10, p1_11, p1_12, p1_13, p1_14, p1_15;
byte p1_16, p1_17, p1_18, p1_19, p1_20, p1_21, p1_22, p1_23, p1_24,
p1_25, p1_26, p1_27, p1_28, p1_29, p1_30, p1_31;
byte p1_32, p1_33, p1_34, p1_35, p1_36, p1_37, p1_38, p1_39, p1_40,
p1_41, p1_42, p1_43, p1_44, p1_45, p1_46, p1_47;
byte p1_48, p1_49, p1_50, p1_51, p1_52, p1_53, p1_54, p1_55, p1_56,
p1_57, p1_58, p1_59, p1_60, p1_61, p1_62, p1_63;
}
{code}
did not help :(, results are even slightly worse.
{code:java}
Results:
Op rate : 85,575 op/s [insert: 85,575 op/s]
Partition rate : 85,575 pk/s [insert: 85,575 pk/s]
Row rate : 855,747 row/s [insert: 855,747 row/s]
Latency mean : 1.2 ms [insert: 1.2 ms]
Latency median : 0.9 ms [insert: 0.9 ms]
Latency 95th percentile : 1.9 ms [insert: 1.9 ms]
Latency 99th percentile : 3.5 ms [insert: 3.5 ms]
Latency 99.9th percentile : 39.0 ms [insert: 39.0 ms]
Latency max : 152.2 ms [insert: 152.2 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:56
{code}
[^5.1_batch_pad_allocated.html]
* Allocation batching
** A simple prototype for the first option mentioned by Benedict is added:
{quote}If we don't contend our update and no existing data is present, we can
accurately calculate the space we require.
{quote}
The current version is only for offheap_objects case now and only for adding a
new row into a memtable.
{code:java}
Results:
Op rate : 111,119 op/s [insert: 111,119 op/s]
Partition rate : 111,119 pk/s [insert: 111,119 pk/s]
Row rate : 1,111,192 row/s [insert: 1,111,192 row/s]
Latency mean : 0.9 ms [insert: 0.9 ms]
Latency median : 0.7 ms [insert: 0.7 ms]
Latency 95th percentile : 1.2 ms [insert: 1.2 ms]
Latency 99th percentile : 2.1 ms [insert: 2.1 ms]
Latency 99.9th percentile : 38.8 ms [insert: 38.8 ms]
Latency max : 163.1 ms [insert: 163.1 ms]
Total partitions : 10,000,000 [insert: 10,000,000]
Total errors : 0 [insert: 0]
Total GC count : 0
Total GC memory : 0 B
Total GC time : 0.0 seconds
Avg GC time : NaN ms
StdDev GC time : 0.0 ms
Total operation time : 00:01:29
{code}
[^5.1_batch_alloc_batching.html]
[https://github.com/apache/cassandra/commit/23efee95e07ef169e99827954bc2d7974af3f314]
> Reduce contention in NativeAllocator.allocate
> ---------------------------------------------
>
> Key: CASSANDRA-20226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Memtable
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html,
> 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html,
> 5.1_batch_pad_allocated.html, cpu_profile_batch.html,
> image-2025-01-20-23-38-58-896.png, profile.yaml
>
>
> For a high insert batch rate it looks like we have a bottleneck in
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
> # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a
> while loop with a CAS, which can be non-efficient under a high contention,
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to
> check if it does not break the allocator logic)
> # swap region logic in NativeAllocator.trySwapRegion (under a high insert
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
> * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup
> ops(insert=1) n=10m" -rate threads=100 -node somenode
> {code}
> * Cassandra version: 5.0.3
> * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
> configurations:
> skiplist:
> class_name: SkipListMemtable
> trie:
> class_name: TrieMemtable
> parameters:
> shards: 32
> default:
> inherits: trie
> {code}
> * 1 node cluster
> * OpenJDK jdk-17.0.12+7
> * Linux kernel: 4.18.0-240.el8.x86_64
> * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> * RAM: 46GiB
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]