[ https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938001#comment-17938001 ]
Dmitry Konstantinov edited comment on CASSANDRA-20226 at 3/24/25 8:40 PM: -------------------------------------------------------------------------- Some initial experiment results (looks promising: 513,289 row/s => 1,111,192 row/s). h3. Configuration Before applying the changes I have adjusted them number of flushing threads from default 2 to 4 because after ~500-600k rows/sec flushing logic becomes a bottleneck (memtable allocation backpressure is becoming active). I expect to get some improvements for flushing itself in CASSANDRA-20173 and CASSANDRA-20465. So, the total set of changed parameters: {code:java} -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of async profiler -Dio.netty.eventLoopThreads=4 # we do not need that than many as default = 2 * CPU cores actually memtable_allocation_type: offheap_objects memtable: configurations: skiplist: class_name: SkipListMemtable trie: class_name: TrieMemtable parameters: shards: 32 default: inherits: trie commitlog_disk_access_mode: direct native_transport_max_request_data_in_flight: 1024MiB native_transport_max_request_data_in_flight_per_ip: 1024MiB {code} Compaction is enabled. Test logic (1 partition text column, 1 clustering text column, 5 value text columns, inserts are down using 10-row batches) {code:java} ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup ops(insert=1) n=10m" -rate threads=100 -node somenode {code} h3. Baseline Baseline using the configuration (recent 5.1/trunk build, commit: 10c8c042): {code:java} Results: Op rate : 51,329 op/s [insert: 51,329 op/s] Partition rate : 51,329 pk/s [insert: 51,329 pk/s] Row rate : 513,289 row/s [insert: 513,289 row/s] Latency mean : 1.9 ms [insert: 1.9 ms] Latency median : 1.5 ms [insert: 1.5 ms] Latency 95th percentile : 3.8 ms [insert: 3.8 ms] Latency 99th percentile : 7.4 ms [insert: 7.4 ms] Latency 99.9th percentile : 41.1 ms [insert: 41.1 ms] Latency max : 114.8 ms [insert: 114.8 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:03:14 {code} [^5.1_batch_baseline.html] h3. Changes it is a sketch as of now, not a final implementation. The changes are applied one on top of another one, in the described order (except the padding, it is not applied). * Existing allocation logic optimizations: ** To reduce contention I replaced CAS loop in MemtablePool.SubPool#tryAllocate with allocatedUpdater.addAndGet(this, size); {code:java} Results: Op rate : 56,040 op/s [insert: 56,356 op/s] Partition rate : 56,040 pk/s [insert: 56,356 pk/s] Row rate : 560,405 row/s [insert: 563,563 row/s] Latency mean : 1.8 ms [insert: 1.8 ms] Latency median : 1.4 ms [insert: 1.4 ms] Latency 95th percentile : 3.4 ms [insert: 3.4 ms] Latency 99th percentile : 7.5 ms [insert: 7.5 ms] Latency 99.9th percentile : 41.3 ms [insert: 41.3 ms] Latency max : 293.6 ms [insert: 293.6 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:02:58 {code} [^5.1_batch_addAndGet.html] [https://github.com/apache/cassandra/commit/842623a89042e6d55bc86cc225d348a0db3c5666] * ** To reduce contention I switched MemtableAllocator.SubAllocator#owns from updates via AtomicLongFieldUpdater to LongAdder usage: MemtableAllocator.SubAllocator#acquired(..) method used in a hot path updates "owns" value but does not use the updated result. {code:java} Results: Op rate : 90,220 op/s [insert: 90,220 op/s] Partition rate : 90,220 pk/s [insert: 90,220 pk/s] Row rate : 902,196 row/s [insert: 902,196 row/s] Latency mean : 1.1 ms [insert: 1.1 ms] Latency median : 0.9 ms [insert: 0.9 ms] Latency 95th percentile : 1.7 ms [insert: 1.7 ms] Latency 99th percentile : 3.1 ms [insert: 3.1 ms] Latency 99.9th percentile : 39.6 ms [insert: 39.6 ms] Latency max : 130.6 ms [insert: 130.6 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:01:50 {code} [^5.1_batch_LongAdder.html] [https://github.com/apache/cassandra/commit/1deef005705e64abfa17c7d3117854be00dc7189] * ** Attempts to reduce a possible false sharing by adding padding around MemtablePool.SubPool.allocated/reclaiming/nextClean fields as well as MemtableAllocator.SubAllocator.state/reclaiming fields, using sub-class approach, like: {code:java} public static class SubPoolPadding0 { // why this padding style is used: https://shipilev.net/jvm/objects-inside-out/#_observation_hierarchy_tower_padding_trick_collapse_in_jdk_15 byte p0_00, p0_01, p0_02, p0_03, p0_04, p0_05, p0_06, p0_07, p0_08, p0_09, p0_10, p0_11, p0_12, p0_13, p0_14, p0_15; byte p0_16, p0_17, p0_18, p0_19, p0_20, p0_21, p0_22, p0_23, p0_24, p0_25, p0_26, p0_27, p0_28, p0_29, p0_30, p0_31; byte p0_32, p0_33, p0_34, p0_35, p0_36, p0_37, p0_38, p0_39, p0_40, p0_41, p0_42, p0_43, p0_44, p0_45, p0_46, p0_47; byte p0_48, p0_49, p0_50, p0_51, p0_52, p0_53, p0_54, p0_55, p0_56, p0_57, p0_58, p0_59, p0_60, p0_61, p0_62, p0_63; } public static class SubPoolPadding1 extends SubPoolPadding0 { // total bytes allocated and reclaiming volatile long allocated; } public static class SubPoolPadding2 extends SubPoolPadding1 { byte p1_00, p1_01, p1_02, p1_03, p1_04, p1_05, p1_06, p1_07, p1_08, p1_09, p1_10, p1_11, p1_12, p1_13, p1_14, p1_15; byte p1_16, p1_17, p1_18, p1_19, p1_20, p1_21, p1_22, p1_23, p1_24, p1_25, p1_26, p1_27, p1_28, p1_29, p1_30, p1_31; byte p1_32, p1_33, p1_34, p1_35, p1_36, p1_37, p1_38, p1_39, p1_40, p1_41, p1_42, p1_43, p1_44, p1_45, p1_46, p1_47; byte p1_48, p1_49, p1_50, p1_51, p1_52, p1_53, p1_54, p1_55, p1_56, p1_57, p1_58, p1_59, p1_60, p1_61, p1_62, p1_63; } {code} did not help :(, results are even slightly worse. {code:java} Results: Op rate : 85,575 op/s [insert: 85,575 op/s] Partition rate : 85,575 pk/s [insert: 85,575 pk/s] Row rate : 855,747 row/s [insert: 855,747 row/s] Latency mean : 1.2 ms [insert: 1.2 ms] Latency median : 0.9 ms [insert: 0.9 ms] Latency 95th percentile : 1.9 ms [insert: 1.9 ms] Latency 99th percentile : 3.5 ms [insert: 3.5 ms] Latency 99.9th percentile : 39.0 ms [insert: 39.0 ms] Latency max : 152.2 ms [insert: 152.2 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:01:56 {code} [^5.1_batch_pad_allocated.html] * Allocation batching ** A simple prototype for the first option mentioned by Benedict is added: {quote}If we don't contend our update and no existing data is present, we can accurately calculate the space we require. {quote} The current version is only for offheap_objects case now and only for adding a new row into a memtable. {code:java} Results: Op rate : 111,119 op/s [insert: 111,119 op/s] Partition rate : 111,119 pk/s [insert: 111,119 pk/s] Row rate : 1,111,192 row/s [insert: 1,111,192 row/s] Latency mean : 0.9 ms [insert: 0.9 ms] Latency median : 0.7 ms [insert: 0.7 ms] Latency 95th percentile : 1.2 ms [insert: 1.2 ms] Latency 99th percentile : 2.1 ms [insert: 2.1 ms] Latency 99.9th percentile : 38.8 ms [insert: 38.8 ms] Latency max : 163.1 ms [insert: 163.1 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:01:29 {code} [^5.1_batch_alloc_batching.html] [https://github.com/apache/cassandra/commit/23efee95e07ef169e99827954bc2d7974af3f314] was (Author: dnk): Some initial experiment results (looks promising: 513,289 row/s => 1,111,192 row/s). h3. Configuration Before applying the changes I have adjusted them number of flushing threads from default 2 to 4 because after ~500-600k rows/sec flushing logic becomes a bottleneck (memtable allocation backpressure is becoming active). I expect to get some improvements for flushing itself in CASSANDRA-20173 and CASSANDRA-20465. So, the total set of changed parameters: {code:java} -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of async profiler -Dio.netty.eventLoopThreads=4 # we do not them than many as 2 * CPU cores actually memtable_allocation_type: offheap_objects memtable: configurations: skiplist: class_name: SkipListMemtable trie: class_name: TrieMemtable parameters: shards: 32 default: inherits: trie commitlog_disk_access_mode: direct native_transport_max_request_data_in_flight: 1024MiB native_transport_max_request_data_in_flight_per_ip: 1024MiB {code} Compaction is enabled. Test logic (1 partition text column, 1 clustering text column, 5 value text columns, inserts are down using 10-row batches) {code:java} ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup ops(insert=1) n=10m" -rate threads=100 -node somenode {code} h3. Baseline Baseline using the configuration (recent 5.1/trunk build, commit: 10c8c042): {code:java} Results: Op rate : 51,329 op/s [insert: 51,329 op/s] Partition rate : 51,329 pk/s [insert: 51,329 pk/s] Row rate : 513,289 row/s [insert: 513,289 row/s] Latency mean : 1.9 ms [insert: 1.9 ms] Latency median : 1.5 ms [insert: 1.5 ms] Latency 95th percentile : 3.8 ms [insert: 3.8 ms] Latency 99th percentile : 7.4 ms [insert: 7.4 ms] Latency 99.9th percentile : 41.1 ms [insert: 41.1 ms] Latency max : 114.8 ms [insert: 114.8 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:03:14 {code} [^5.1_batch_baseline.html] h3. Changes it is a sketch as of now, not a final implementation. The changes are applied one on top of another one, in the described order (except the padding, it is not applied). * Existing allocation logic optimizations: ** To reduce contention I replaced CAS loop in MemtablePool.SubPool#tryAllocate with allocatedUpdater.addAndGet(this, size); {code:java} Results: Op rate : 56,040 op/s [insert: 56,356 op/s] Partition rate : 56,040 pk/s [insert: 56,356 pk/s] Row rate : 560,405 row/s [insert: 563,563 row/s] Latency mean : 1.8 ms [insert: 1.8 ms] Latency median : 1.4 ms [insert: 1.4 ms] Latency 95th percentile : 3.4 ms [insert: 3.4 ms] Latency 99th percentile : 7.5 ms [insert: 7.5 ms] Latency 99.9th percentile : 41.3 ms [insert: 41.3 ms] Latency max : 293.6 ms [insert: 293.6 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:02:58 {code} [^5.1_batch_addAndGet.html] [https://github.com/apache/cassandra/commit/842623a89042e6d55bc86cc225d348a0db3c5666] * ** To reduce contention I switched MemtableAllocator.SubAllocator#owns from updates via AtomicLongFieldUpdater to LongAdder usage: MemtableAllocator.SubAllocator#acquired(..) method used in a hot path updates "owns" value but does not use the updated result. {code:java} Results: Op rate : 90,220 op/s [insert: 90,220 op/s] Partition rate : 90,220 pk/s [insert: 90,220 pk/s] Row rate : 902,196 row/s [insert: 902,196 row/s] Latency mean : 1.1 ms [insert: 1.1 ms] Latency median : 0.9 ms [insert: 0.9 ms] Latency 95th percentile : 1.7 ms [insert: 1.7 ms] Latency 99th percentile : 3.1 ms [insert: 3.1 ms] Latency 99.9th percentile : 39.6 ms [insert: 39.6 ms] Latency max : 130.6 ms [insert: 130.6 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:01:50 {code} [^5.1_batch_LongAdder.html] [https://github.com/apache/cassandra/commit/1deef005705e64abfa17c7d3117854be00dc7189] * ** Attempts to reduce a possible false sharing by adding padding around MemtablePool.SubPool.allocated/reclaiming/nextClean fields as well as MemtableAllocator.SubAllocator.state/reclaiming fields, using sub-class approach, like: {code:java} public static class SubPoolPadding0 { // why this padding style is used: https://shipilev.net/jvm/objects-inside-out/#_observation_hierarchy_tower_padding_trick_collapse_in_jdk_15 byte p0_00, p0_01, p0_02, p0_03, p0_04, p0_05, p0_06, p0_07, p0_08, p0_09, p0_10, p0_11, p0_12, p0_13, p0_14, p0_15; byte p0_16, p0_17, p0_18, p0_19, p0_20, p0_21, p0_22, p0_23, p0_24, p0_25, p0_26, p0_27, p0_28, p0_29, p0_30, p0_31; byte p0_32, p0_33, p0_34, p0_35, p0_36, p0_37, p0_38, p0_39, p0_40, p0_41, p0_42, p0_43, p0_44, p0_45, p0_46, p0_47; byte p0_48, p0_49, p0_50, p0_51, p0_52, p0_53, p0_54, p0_55, p0_56, p0_57, p0_58, p0_59, p0_60, p0_61, p0_62, p0_63; } public static class SubPoolPadding1 extends SubPoolPadding0 { // total bytes allocated and reclaiming volatile long allocated; } public static class SubPoolPadding2 extends SubPoolPadding1 { byte p1_00, p1_01, p1_02, p1_03, p1_04, p1_05, p1_06, p1_07, p1_08, p1_09, p1_10, p1_11, p1_12, p1_13, p1_14, p1_15; byte p1_16, p1_17, p1_18, p1_19, p1_20, p1_21, p1_22, p1_23, p1_24, p1_25, p1_26, p1_27, p1_28, p1_29, p1_30, p1_31; byte p1_32, p1_33, p1_34, p1_35, p1_36, p1_37, p1_38, p1_39, p1_40, p1_41, p1_42, p1_43, p1_44, p1_45, p1_46, p1_47; byte p1_48, p1_49, p1_50, p1_51, p1_52, p1_53, p1_54, p1_55, p1_56, p1_57, p1_58, p1_59, p1_60, p1_61, p1_62, p1_63; } {code} did not help :(, results are even slightly worse. {code:java} Results: Op rate : 85,575 op/s [insert: 85,575 op/s] Partition rate : 85,575 pk/s [insert: 85,575 pk/s] Row rate : 855,747 row/s [insert: 855,747 row/s] Latency mean : 1.2 ms [insert: 1.2 ms] Latency median : 0.9 ms [insert: 0.9 ms] Latency 95th percentile : 1.9 ms [insert: 1.9 ms] Latency 99th percentile : 3.5 ms [insert: 3.5 ms] Latency 99.9th percentile : 39.0 ms [insert: 39.0 ms] Latency max : 152.2 ms [insert: 152.2 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:01:56 {code} [^5.1_batch_pad_allocated.html] * Allocation batching ** A simple prototype for the first option mentioned by Benedict is added: {quote}If we don't contend our update and no existing data is present, we can accurately calculate the space we require. {quote} The current version is only for offheap_objects case now and only for adding a new row into a memtable. {code:java} Results: Op rate : 111,119 op/s [insert: 111,119 op/s] Partition rate : 111,119 pk/s [insert: 111,119 pk/s] Row rate : 1,111,192 row/s [insert: 1,111,192 row/s] Latency mean : 0.9 ms [insert: 0.9 ms] Latency median : 0.7 ms [insert: 0.7 ms] Latency 95th percentile : 1.2 ms [insert: 1.2 ms] Latency 99th percentile : 2.1 ms [insert: 2.1 ms] Latency 99.9th percentile : 38.8 ms [insert: 38.8 ms] Latency max : 163.1 ms [insert: 163.1 ms] Total partitions : 10,000,000 [insert: 10,000,000] Total errors : 0 [insert: 0] Total GC count : 0 Total GC memory : 0 B Total GC time : 0.0 seconds Avg GC time : NaN ms StdDev GC time : 0.0 ms Total operation time : 00:01:29 {code} [^5.1_batch_alloc_batching.html] [https://github.com/apache/cassandra/commit/23efee95e07ef169e99827954bc2d7974af3f314] > Reduce contention in NativeAllocator.allocate > --------------------------------------------- > > Key: CASSANDRA-20226 > URL: https://issues.apache.org/jira/browse/CASSANDRA-20226 > Project: Apache Cassandra > Issue Type: Improvement > Components: Local/Memtable > Reporter: Dmitry Konstantinov > Assignee: Dmitry Konstantinov > Priority: Normal > Fix For: 5.x > > Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html, > 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html, > 5.1_batch_pad_allocated.html, cpu_profile_batch.html, > image-2025-01-20-23-38-58-896.png, profile.yaml > > > For a high insert batch rate it looks like we have a bottleneck in > NativeAllocator.allocate probably caused by contention within the logic. > !image-2025-01-20-23-38-58-896.png|width=300! > [^cpu_profile_batch.html] > The logic has at least the following 2 potential places to assess: > # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a > while loop with a CAS, which can be non-efficient under a high contention, > similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to > check if it does not break the allocator logic) > # swap region logic in NativeAllocator.trySwapRegion (under a high insert > rate 1MiB regions can be swapped quite frequently) > Reproducing test details: > * test logic > {code:java} > ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup > ops(insert=1) n=10m" -rate threads=100 -node somenode > {code} > * Cassandra version: 5.0.3 > * configuration changes compared to default: > {code:java} > memtable_allocation_type: offheap_objects > memtable: > configurations: > skiplist: > class_name: SkipListMemtable > trie: > class_name: TrieMemtable > parameters: > shards: 32 > default: > inherits: trie > {code} > * 1 node cluster > * OpenJDK jdk-17.0.12+7 > * Linux kernel: 4.18.0-240.el8.x86_64 > * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz > * RAM: 46GiB -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org