[ 
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927156#comment-17927156
 ] 

Dmitry Konstantinov edited comment on CASSANDRA-20226 at 2/14/25 3:52 PM:
--------------------------------------------------------------------------

trySwapRegion - yes, it also looks suspicious for me, I tried to change the 
region size to see if it improves the situation but it did not help a lot.. I 
want to analyze it in more details.

Replacing CAS with addAndGet looks possible and makes some improvement in 
throughput (I will publish reports soon) but still the allocation logic is a 
bottleneck. 

It is very similar to JDK new gen bump allocation story and they solved it by 
introducing TLABs. It is possible to do it here as well but the challenge is 
how to control the space waste by the TLABs: compared to JDK we have a 
multiplication here: number of threads x number of memtables, it makes the 
space waste issue more sensitive. To calculate TLABs size OpenJDK measure the 
allocation rate per thread and allocates TLABs proportionally to it, taking in 
account a configured TLAB waste percent
 
(https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp#L154)

Alternative could be predicting/estimating  a space required for all allocation 
during a single memtable mutation (or a single row) but I need to check if a 
good enough estimation can be done in a cheap way..


was (Author: dnk):
Yes, it also looks suspicious for me, I tried to change the region size to see 
if it improves the situation but it did not help a lot.. I want to analyze it 
in more details.

Replacing CAS with addAndGet looks possible and makes some improvement in 
throughput (I will publish reports soon) but still the allocation logic is a 
bottleneck. 

It is very similar to JDK new gen bump allocation story and they solved it by 
introducing TLABs. It is possible to do it here as well but the challenge is 
how to control the space waste by the TLABs: compared to JDK we have a 
multiplication here: number of threads x number of memtables, it makes the 
space waste issue more sensitive. To calculate TLABs size OpenJDK measure the 
allocation rate per thread and allocates TLABs proportionally to it, taking in 
account a configured TLAB waste percent
 
(https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp#L154)

Alternative could be predicting/estimating  a space required for all allocation 
during a single memtable mutation (or a single row) but I need to check if a 
good enough estimation can be done in a cheap way..

> Reduce contention in NativeAllocator.allocate
> ---------------------------------------------
>
>                 Key: CASSANDRA-20226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Memtable
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>         Attachments: cpu_profile_batch.html, 
> image-2025-01-20-23-38-58-896.png, profile.yaml
>
>
> For a high insert batch rate it looks like we have a bottleneck in 
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
>  # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a 
> while loop with a CAS, which can be non-efficient under a high contention, 
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to 
> check if it does not break the allocator logic)
>  # swap region logic in NativeAllocator.trySwapRegion (under a high insert 
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
>  * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
> ops(insert=1) n=10m" -rate threads=100  -node somenode
> {code}
>  * Cassandra version: 5.0.3
>  * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
>   configurations:
>     skiplist:
>       class_name: SkipListMemtable
>     trie:
>       class_name: TrieMemtable
>       parameters:
>              shards: 32
>     default:
>       inherits: trie 
> {code}
>  * 1 node cluster
>  * OpenJDK jdk-17.0.12+7
>  * Linux kernel: 4.18.0-240.el8.x86_64
>  * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>  * RAM: 46GiB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to