[ 
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036790#comment-18036790
 ] 

Dmitry Konstantinov edited comment on CASSANDRA-20226 at 11/10/25 11:34 AM:
----------------------------------------------------------------------------

Hi [~blambov] , thank you for the idea, it is really interesting. I think it is 
complementary to the current changes and can be applied on top of them: while 
I've tuned a bit the accounting logic the main idea of the current changes is 
about introducing the logic to predict and pre-allocate memory for an incoming 
mutation - it not only reduces overheads for accounting logic by using it less 
frequently but also gives a better locality for allocated cells in 
slabs/offheap (so it is more CPU cache friendly) by avoiding a random mix of 
cells written by concurrent threads, so it is useful even if we remove 
accounting overheads for the per-cell allocation logic.

{quote}
This means that the limit will be breached, but this also happens as it stands 
now because we will permit operations to run to completion if the memtable they 
have been marked for is scheduled for a flush 
{quote}
I suppose except the case when we have writes to several tables in parallel and 
we are flushing one (largest) memtable but block allocation in others..
But in general, I agree - it looks benefitial to split allocation itself and 
limits checking/blocking.

Currently, memory allocation itself (in byte buffer or native slabs) and 
accounting (to initiate flush and to pause writes) are tightly connected, so it 
may take some time to restructure this logic to decouple them. My suggestion is 
to extract this improvement into a separate story (I can take), to not 
over-complicate the current change and to no delay it too much. If it is ok I 
create the story.


was (Author: dnk):
Hi [~blambov] , thank you for the idea, it is really interesting. I think it is 
complementary to the current changes and can be applied on top of them: while 
I've tuned a bit the accounting logic the main idea of the current changes is 
about introducing the logic to predict and pre-allocate memory for the incoming 
mutation - it is not only reduce overheads for accounting logic by using it 
less frequently but also give a better locality for allocated cells in 
slabs/offheap (so it is more CPU cache friendly) by avoiding a random mix of 
cells written by concurrent threads, so it is useful even if we remove 
accounting overheads for the per-cell allocation logic.

{quote}
This means that the limit will be breached, but this also happens as it stands 
now because we will permit operations to run to completion if the memtable they 
have been marked for is scheduled for a flush 
{quote}
I suppose except the case when we have writes to several tables in parallel and 
we are flushing one (largest) memtable but block allocation in others..
But in general, I agree - it looks benefitial to split allocation itself and 
limits checking/blocking.

Currently, memory allocation itself (in byte buffer or native slabs) and 
accounting (to initiate flush and to pause writes) are tightly connected, so it 
may take some time to restructure this logic to decouple them. My suggestion is 
to extract this improvement into a separate story (I can take), to not 
over-complicate the current change and to no delay it too much. If it is ok I 
create the story.

> Reduce contention in MemtableAllocator.allocate
> -----------------------------------------------
>
>                 Key: CASSANDRA-20226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Memtable
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html, 
> 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html, 
> 5.1_batch_pad_allocated.html, CASSANDRA-20226_ci_summary.htm, 
> CASSANDRA-20226_results_details.tar.xz, 
> ci_summary_netudima_CASSANDRA-20226-trunk_52.html, cpu_profile_batch.html, 
> image-2025-01-20-23-38-58-896.png, image-2025-11-10-00-04-57-497.png, 
> profile.yaml, results_details_netudima_CASSANDRA-20226-trunk_52.tar.xz, 
> test_results_m8i.4xlarge_heap_buffers.html, 
> test_results_m8i.4xlarge_heap_buffers.png, 
> test_results_m8i.4xlarge_offheap_objects.html, 
> test_results_m8i.4xlarge_offheap_objects.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For a high insert batch rate it looks like we have a bottleneck in 
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
>  # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a 
> while loop with a CAS, which can be non-efficient under a high contention, 
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to 
> check if it does not break the allocator logic)
>  # swap region logic in NativeAllocator.trySwapRegion (under a high insert 
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
>  * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
> ops(insert=1) n=10m" -rate threads=100  -node somenode
> {code}
>  * Cassandra version: 5.0.3
>  * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
>   configurations:
>     skiplist:
>       class_name: SkipListMemtable
>     trie:
>       class_name: TrieMemtable
>       parameters:
>              shards: 32
>     default:
>       inherits: trie 
> {code}
>  * 1 node cluster
>  * OpenJDK jdk-17.0.12+7
>  * Linux kernel: 4.18.0-240.el8.x86_64
>  * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>  * RAM: 46GiB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to