[jira] [Comment Edited] (CASSANDRA-20226) Reduce contention in NativeAllocator.allocate

Dmitry Konstantinov (Jira) Mon, 24 Mar 2025 13:35:46 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938001#comment-17938001
 ]


Dmitry Konstantinov edited comment on CASSANDRA-20226 at 3/24/25 8:34 PM:
--------------------------------------------------------------------------

Some initial experiment results (looks promising).
h3. Configuration

Before applying the changes I have adjusted them number of flushing threads 
from default 2 to 4 because after ~500-600k rows/sec flushing logic becomes a 
bottleneck (memtable allocation backpressure is becoming active). I expect to 
get some improvements for flushing itself in CASSANDRA-20173 and 
CASSANDRA-20465.
So, the total set of changed parameters:
{code:java}
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of 
async profiler
-Dio.netty.eventLoopThreads=4 # we do not them than many as 2 * CPU cores 
actually
memtable_allocation_type: offheap_objects
memtable:
  configurations:
    skiplist:
      class_name: SkipListMemtable
    trie:
      class_name: TrieMemtable
      parameters:
             shards: 32
    default:
      inherits: trie 

commitlog_disk_access_mode: direct

native_transport_max_request_data_in_flight: 1024MiB
native_transport_max_request_data_in_flight_per_ip: 1024MiB
{code}
Compaction is enabled.

Test logic (1 partition text column, 1 clustering text column, 5 value text 
columns, inserts are down using 10-row batches)
{code:java}
./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
ops(insert=1) n=10m" -rate threads=100  -node somenode
{code}
h3. Baseline

Baseline using the configuration (recent 5.1/trunk build, commit: 10c8c042):
{code:java}
Results:
Op rate                   :   51,329 op/s  [insert: 51,329 op/s]
Partition rate            :   51,329 pk/s  [insert: 51,329 pk/s]
Row rate                  :  513,289 row/s [insert: 513,289 row/s]
Latency mean              :    1.9 ms [insert: 1.9 ms]
Latency median            :    1.5 ms [insert: 1.5 ms]
Latency 95th percentile   :    3.8 ms [insert: 3.8 ms]
Latency 99th percentile   :    7.4 ms [insert: 7.4 ms]
Latency 99.9th percentile :   41.1 ms [insert: 41.1 ms]
Latency max               :  114.8 ms [insert: 114.8 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:03:14
{code}
[^5.1_batch_baseline.html]
h3. Changes

it is a sketch as of now, not a final implementation. The changes are applied 
one on top of another one, in the described order (except the padding, it is 
not applied).
 * Existing allocation logic optimizations:
 ** To reduce contention I replaced CAS loop in 
MemtablePool.SubPool#tryAllocate with allocatedUpdater.addAndGet(this, size);
{code:java}
Results:
Op rate                   :   56,040 op/s  [insert: 56,356 op/s]
Partition rate            :   56,040 pk/s  [insert: 56,356 pk/s]
Row rate                  :  560,405 row/s [insert: 563,563 row/s]
Latency mean              :    1.8 ms [insert: 1.8 ms]
Latency median            :    1.4 ms [insert: 1.4 ms]
Latency 95th percentile   :    3.4 ms [insert: 3.4 ms]
Latency 99th percentile   :    7.5 ms [insert: 7.5 ms]
Latency 99.9th percentile :   41.3 ms [insert: 41.3 ms]
Latency max               :  293.6 ms [insert: 293.6 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:02:58
{code}
[^5.1_batch_addAndGet.html]
[https://github.com/apache/cassandra/commit/842623a89042e6d55bc86cc225d348a0db3c5666]
 

 * 
 ** To reduce contention I switched MemtableAllocator.SubAllocator#owns from 
updates via AtomicLongFieldUpdater to LongAdder usage: 
MemtableAllocator.SubAllocator#acquired(..) method used in a hot path updates 
"owns" value but does not use the updated result.
{code:java}
Results:
Op rate                   :   90,220 op/s  [insert: 90,220 op/s]
Partition rate            :   90,220 pk/s  [insert: 90,220 pk/s]
Row rate                  :  902,196 row/s [insert: 902,196 row/s]
Latency mean              :    1.1 ms [insert: 1.1 ms]
Latency median            :    0.9 ms [insert: 0.9 ms]
Latency 95th percentile   :    1.7 ms [insert: 1.7 ms]
Latency 99th percentile   :    3.1 ms [insert: 3.1 ms]
Latency 99.9th percentile :   39.6 ms [insert: 39.6 ms]
Latency max               :  130.6 ms [insert: 130.6 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:50
{code}
[^5.1_batch_LongAdder.html]
[https://github.com/apache/cassandra/commit/1deef005705e64abfa17c7d3117854be00dc7189]
 

 * 
 ** Attempts to reduce a possible false sharing by adding padding around 
MemtablePool.SubPool.allocated/reclaiming/nextClean fields as well as 
MemtableAllocator.SubAllocator.state/reclaiming fields, using sub-class 
approach, like:
{code:java}
    public static class SubPoolPadding0
    {
       // why this padding style is used: 
https://shipilev.net/jvm/objects-inside-out/#_observation_hierarchy_tower_padding_trick_collapse_in_jdk_15
        byte p0_00, p0_01, p0_02, p0_03, p0_04, p0_05, p0_06, p0_07, p0_08, 
p0_09, p0_10, p0_11, p0_12, p0_13, p0_14, p0_15;
        byte p0_16, p0_17, p0_18, p0_19, p0_20, p0_21, p0_22, p0_23, p0_24, 
p0_25, p0_26, p0_27, p0_28, p0_29, p0_30, p0_31;
        byte p0_32, p0_33, p0_34, p0_35, p0_36, p0_37, p0_38, p0_39, p0_40, 
p0_41, p0_42, p0_43, p0_44, p0_45, p0_46, p0_47;
        byte p0_48, p0_49, p0_50, p0_51, p0_52, p0_53, p0_54, p0_55, p0_56, 
p0_57, p0_58, p0_59, p0_60, p0_61, p0_62, p0_63;
    }
    public static class SubPoolPadding1 extends SubPoolPadding0
    {
        // total bytes allocated and reclaiming
        volatile long allocated;
    }

    public static class SubPoolPadding2 extends SubPoolPadding1
    {
        byte p1_00, p1_01, p1_02, p1_03, p1_04, p1_05, p1_06, p1_07, p1_08, 
p1_09, p1_10, p1_11, p1_12, p1_13, p1_14, p1_15;
        byte p1_16, p1_17, p1_18, p1_19, p1_20, p1_21, p1_22, p1_23, p1_24, 
p1_25, p1_26, p1_27, p1_28, p1_29, p1_30, p1_31;
        byte p1_32, p1_33, p1_34, p1_35, p1_36, p1_37, p1_38, p1_39, p1_40, 
p1_41, p1_42, p1_43, p1_44, p1_45, p1_46, p1_47;
        byte p1_48, p1_49, p1_50, p1_51, p1_52, p1_53, p1_54, p1_55, p1_56, 
p1_57, p1_58, p1_59, p1_60, p1_61, p1_62, p1_63;
    }
{code}
did not help :(, results are even slightly worse.
{code:java}
Results:
Op rate                   :   85,575 op/s  [insert: 85,575 op/s]
Partition rate            :   85,575 pk/s  [insert: 85,575 pk/s]
Row rate                  :  855,747 row/s [insert: 855,747 row/s]
Latency mean              :    1.2 ms [insert: 1.2 ms]
Latency median            :    0.9 ms [insert: 0.9 ms]
Latency 95th percentile   :    1.9 ms [insert: 1.9 ms]
Latency 99th percentile   :    3.5 ms [insert: 3.5 ms]
Latency 99.9th percentile :   39.0 ms [insert: 39.0 ms]
Latency max               :  152.2 ms [insert: 152.2 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:56
{code}
[^5.1_batch_pad_allocated.html]

 * Allocation batching
 ** A simple prototype for the first option mentioned by Benedict is added:
{quote}If we don't contend our update and no existing data is present, we can 
accurately calculate the space we require.
{quote}
The current version is only for offheap_objects case now and only for adding a 
new row into a memtable.
{code:java}
Results:
Op rate                   :  111,119 op/s  [insert: 111,119 op/s]
Partition rate            :  111,119 pk/s  [insert: 111,119 pk/s]
Row rate                  : 1,111,192 row/s [insert: 1,111,192 row/s]
Latency mean              :    0.9 ms [insert: 0.9 ms]
Latency median            :    0.7 ms [insert: 0.7 ms]
Latency 95th percentile   :    1.2 ms [insert: 1.2 ms]
Latency 99th percentile   :    2.1 ms [insert: 2.1 ms]
Latency 99.9th percentile :   38.8 ms [insert: 38.8 ms]
Latency max               :  163.1 ms [insert: 163.1 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:29
{code}
[^5.1_batch_alloc_batching.html]
[https://github.com/apache/cassandra/commit/23efee95e07ef169e99827954bc2d7974af3f314]
 


was (Author: dnk):
Some initial experiment results (looks promising).
h3. Configuration

Before applying the changes I have adjusted them number of flushing threads 
from default 2 to 4 because after ~500-600k rows/sec flushing logic becomes a 
bottleneck (memtable allocation backpressure is becoming active). I expect to 
get some improvements for flushing itself in CASSANDRA-20173 and 
CASSANDRA-20465.
So, the total set of changed parameters:
{code:java}
-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints # to improve accuracy of 
async profiler
-Dio.netty.eventLoopThreads=4 # we do not them than many as 2 * CPU cores 
actually
memtable_allocation_type: offheap_objects
memtable:
  configurations:
    skiplist:
      class_name: SkipListMemtable
    trie:
      class_name: TrieMemtable
      parameters:
             shards: 32
    default:
      inherits: trie 

commitlog_disk_access_mode: direct

native_transport_max_request_data_in_flight: 1024MiB
native_transport_max_request_data_in_flight_per_ip: 1024MiB
{code}
Compaction is enabled.

Test logic (1 partition text column, 1 clustering text column, 5 value text 
columns, inserts are down using 10-row batches)
{code:java}
./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
ops(insert=1) n=10m" -rate threads=100  -node somenode
{code}
h3. Baseline

Baseline using the configuration (recent 5.1/trunk build, commit: 10c8c042):
{code:java}
Results:
Op rate                   :   51,329 op/s  [insert: 51,329 op/s]
Partition rate            :   51,329 pk/s  [insert: 51,329 pk/s]
Row rate                  :  513,289 row/s [insert: 513,289 row/s]
Latency mean              :    1.9 ms [insert: 1.9 ms]
Latency median            :    1.5 ms [insert: 1.5 ms]
Latency 95th percentile   :    3.8 ms [insert: 3.8 ms]
Latency 99th percentile   :    7.4 ms [insert: 7.4 ms]
Latency 99.9th percentile :   41.1 ms [insert: 41.1 ms]
Latency max               :  114.8 ms [insert: 114.8 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:03:14
{code}
[^5.1_batch_baseline.html]
h3. Changes

it is a sketch as of now, not a final implementation. The changes are applied 
one on top of another one, in the described order (except the padding, it is 
not applied).
 * Existing allocation logic optimizations:
 ** To reduce contention I replaced CAS loop in 
MemtablePool.SubPool#tryAllocate with allocatedUpdater.addAndGet(this, size);
{code:java}
Results:
Op rate                   :   56,040 op/s  [insert: 56,356 op/s]
Partition rate            :   56,040 pk/s  [insert: 56,356 pk/s]
Row rate                  :  560,405 row/s [insert: 563,563 row/s]
Latency mean              :    1.8 ms [insert: 1.8 ms]
Latency median            :    1.4 ms [insert: 1.4 ms]
Latency 95th percentile   :    3.4 ms [insert: 3.4 ms]
Latency 99th percentile   :    7.5 ms [insert: 7.5 ms]
Latency 99.9th percentile :   41.3 ms [insert: 41.3 ms]
Latency max               :  293.6 ms [insert: 293.6 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:02:58
{code}
[^5.1_batch_addAndGet.html]
[https://github.com/apache/cassandra/commit/842623a89042e6d55bc86cc225d348a0db3c5666]
 

 * 
 ** To reduce contention I switched MemtableAllocator.SubAllocator#owns from 
updates via AtomicLongFieldUpdater to LongAdder usage: 
MemtableAllocator.SubAllocator#acquired(..) method used in a hot path updates 
"owns" value but does not use the updated result.
{code:java}
Results:
Op rate                   :   90,220 op/s  [insert: 90,220 op/s]
Partition rate            :   90,220 pk/s  [insert: 90,220 pk/s]
Row rate                  :  902,196 row/s [insert: 902,196 row/s]
Latency mean              :    1.1 ms [insert: 1.1 ms]
Latency median            :    0.9 ms [insert: 0.9 ms]
Latency 95th percentile   :    1.7 ms [insert: 1.7 ms]
Latency 99th percentile   :    3.1 ms [insert: 3.1 ms]
Latency 99.9th percentile :   39.6 ms [insert: 39.6 ms]
Latency max               :  130.6 ms [insert: 130.6 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:50
{code}
[^5.1_batch_LongAdder.html]
[https://github.com/apache/cassandra/commit/1deef005705e64abfa17c7d3117854be00dc7189]
 

 * 
 ** Attempts to reduce a possible false sharing by adding padding around 
MemtablePool.SubPool.allocated/reclaiming/nextClean fields as well as 
MemtableAllocator.SubAllocator.state/reclaiming fields, using sub-class 
approach, like:
{code:java}
    public static class SubPoolPadding0
    {
       // why this padding style is used: 
https://shipilev.net/jvm/objects-inside-out/#_observation_hierarchy_tower_padding_trick_collapse_in_jdk_15
 byte p0_00, p0_01, p0_02, p0_03, p0_04, p0_05, p0_06, p0_07, p0_08, p0_09, 
p0_10, p0_11, p0_12, p0_13, p0_14, p0_15;
        byte p0_16, p0_17, p0_18, p0_19, p0_20, p0_21, p0_22, p0_23, p0_24, 
p0_25, p0_26, p0_27, p0_28, p0_29, p0_30, p0_31;
        byte p0_32, p0_33, p0_34, p0_35, p0_36, p0_37, p0_38, p0_39, p0_40, 
p0_41, p0_42, p0_43, p0_44, p0_45, p0_46, p0_47;
        byte p0_48, p0_49, p0_50, p0_51, p0_52, p0_53, p0_54, p0_55, p0_56, 
p0_57, p0_58, p0_59, p0_60, p0_61, p0_62, p0_63;
    }
    public static class SubPoolPadding1 extends SubPoolPadding0
    {
        // total bytes allocated and reclaiming
        volatile long allocated;
    }

    public static class SubPoolPadding2 extends SubPoolPadding1
    {
        byte p1_00, p1_01, p1_02, p1_03, p1_04, p1_05, p1_06, p1_07, p1_08, 
p1_09, p1_10, p1_11, p1_12, p1_13, p1_14, p1_15;
        byte p1_16, p1_17, p1_18, p1_19, p1_20, p1_21, p1_22, p1_23, p1_24, 
p1_25, p1_26, p1_27, p1_28, p1_29, p1_30, p1_31;
        byte p1_32, p1_33, p1_34, p1_35, p1_36, p1_37, p1_38, p1_39, p1_40, 
p1_41, p1_42, p1_43, p1_44, p1_45, p1_46, p1_47;
        byte p1_48, p1_49, p1_50, p1_51, p1_52, p1_53, p1_54, p1_55, p1_56, 
p1_57, p1_58, p1_59, p1_60, p1_61, p1_62, p1_63;
    }
{code}
did not help :(, results are even slightly worse.
{code:java}
Results:
Op rate                   :   85,575 op/s  [insert: 85,575 op/s]
Partition rate            :   85,575 pk/s  [insert: 85,575 pk/s]
Row rate                  :  855,747 row/s [insert: 855,747 row/s]
Latency mean              :    1.2 ms [insert: 1.2 ms]
Latency median            :    0.9 ms [insert: 0.9 ms]
Latency 95th percentile   :    1.9 ms [insert: 1.9 ms]
Latency 99th percentile   :    3.5 ms [insert: 3.5 ms]
Latency 99.9th percentile :   39.0 ms [insert: 39.0 ms]
Latency max               :  152.2 ms [insert: 152.2 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:56
{code}
[^5.1_batch_pad_allocated.html]

 * Allocation batching
 ** A simple prototype for the first option mentioned by Benedict is added:
{quote}If we don't contend our update and no existing data is present, we can 
accurately calculate the space we require.
{quote}
The current version is only for offheap_objects case now and only for adding a 
new row into a memtable.
{code:java}
Results:
Op rate                   :  111,119 op/s  [insert: 111,119 op/s]
Partition rate            :  111,119 pk/s  [insert: 111,119 pk/s]
Row rate                  : 1,111,192 row/s [insert: 1,111,192 row/s]
Latency mean              :    0.9 ms [insert: 0.9 ms]
Latency median            :    0.7 ms [insert: 0.7 ms]
Latency 95th percentile   :    1.2 ms [insert: 1.2 ms]
Latency 99th percentile   :    2.1 ms [insert: 2.1 ms]
Latency 99.9th percentile :   38.8 ms [insert: 38.8 ms]
Latency max               :  163.1 ms [insert: 163.1 ms]
Total partitions          : 10,000,000 [insert: 10,000,000]
Total errors              :          0 [insert: 0]
Total GC count            : 0
Total GC memory           : 0 B
Total GC time             :    0.0 seconds
Avg GC time               :    NaN ms
StdDev GC time            :    0.0 ms
Total operation time      : 00:01:29
{code}
[^5.1_batch_alloc_batching.html]
[https://github.com/apache/cassandra/commit/23efee95e07ef169e99827954bc2d7974af3f314]
 

> Reduce contention in NativeAllocator.allocate
> ---------------------------------------------
>
>                 Key: CASSANDRA-20226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Memtable
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html, 
> 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html, 
> 5.1_batch_pad_allocated.html, cpu_profile_batch.html, 
> image-2025-01-20-23-38-58-896.png, profile.yaml
>
>
> For a high insert batch rate it looks like we have a bottleneck in 
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
>  # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a 
> while loop with a CAS, which can be non-efficient under a high contention, 
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to 
> check if it does not break the allocator logic)
>  # swap region logic in NativeAllocator.trySwapRegion (under a high insert 
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
>  * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
> ops(insert=1) n=10m" -rate threads=100  -node somenode
> {code}
>  * Cassandra version: 5.0.3
>  * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
>   configurations:
>     skiplist:
>       class_name: SkipListMemtable
>     trie:
>       class_name: TrieMemtable
>       parameters:
>              shards: 32
>     default:
>       inherits: trie 
> {code}
>  * 1 node cluster
>  * OpenJDK jdk-17.0.12+7
>  * Linux kernel: 4.18.0-240.el8.x86_64
>  * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>  * RAM: 46GiB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-20226) Reduce contention in NativeAllocator.allocate

Reply via email to