[ 
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036752#comment-18036752
 ] 

Branimir Lambov commented on CASSANDRA-20226:
---------------------------------------------

There's another option to consider here. The allocation mechanism does not need 
to track individual cell writes. We could just as well track the usage of a 
mutation in a single {{allocate}} call after it completes, or track the 
allocations with a {{LongAdder}} without checking if the limit is hit, and 
check if we need to wait for room before starting to apply a mutation.

We use the {{allocate}} code to decide:
- whether to initiate a flush, when the chosen memory limit is filled to some 
ratio
- whether to pause accepting writes, when the chosen memory limit has been 
exhausted

For the former use there is absolutely no benefit to make these decisions at 
the individual allocation level, as we will wait for the mutation to complete 
anyway before flushing anything. For the latter, I'd argue that the 
allocation-level tracking is actually hurting us. The reason for this is that 
we can have the limit be hit at any time during the application of a mutation, 
holding multiple locks (which necessitates the complexity of the {{isBlocking}} 
mutation signal), a partial copy of the mutation already written to the 
memtable structures, and a likely expanded version of the mutation to be 
applied on heap, keeping hold of more total memory than we would if we allowed 
the operation to continue.

If, instead, we check the allocation limits _before starting_ a mutation and, 
once started, allow it to fully progress to completion, we can avoid this 
situation at the cost of being somewhat late to notice that the limit has been 
reached. This means that the limit will be breached, but this also happens as 
it stands now because we will permit operations to run to completion if the 
memtable they have been marked for is scheduled for a flush -- which is 
effectively the same thing as not having noticed the memory limit would be 
breached by this mutation at the time when we decided to start it.

> Reduce contention in MemtableAllocator.allocate
> -----------------------------------------------
>
>                 Key: CASSANDRA-20226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Memtable
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html, 
> 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html, 
> 5.1_batch_pad_allocated.html, CASSANDRA-20226_ci_summary.htm, 
> CASSANDRA-20226_results_details.tar.xz, 
> ci_summary_netudima_CASSANDRA-20226-trunk_52.html, cpu_profile_batch.html, 
> image-2025-01-20-23-38-58-896.png, image-2025-11-10-00-04-57-497.png, 
> profile.yaml, results_details_netudima_CASSANDRA-20226-trunk_52.tar.xz, 
> test_results_m8i.4xlarge_heap_buffers.html, 
> test_results_m8i.4xlarge_heap_buffers.png, 
> test_results_m8i.4xlarge_offheap_objects.html, 
> test_results_m8i.4xlarge_offheap_objects.png
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For a high insert batch rate it looks like we have a bottleneck in 
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
>  # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a 
> while loop with a CAS, which can be non-efficient under a high contention, 
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to 
> check if it does not break the allocator logic)
>  # swap region logic in NativeAllocator.trySwapRegion (under a high insert 
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
>  * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup 
> ops(insert=1) n=10m" -rate threads=100  -node somenode
> {code}
>  * Cassandra version: 5.0.3
>  * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
>   configurations:
>     skiplist:
>       class_name: SkipListMemtable
>     trie:
>       class_name: TrieMemtable
>       parameters:
>              shards: 32
>     default:
>       inherits: trie 
> {code}
>  * 1 node cluster
>  * OpenJDK jdk-17.0.12+7
>  * Linux kernel: 4.18.0-240.el8.x86_64
>  * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>  * RAM: 46GiB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to