[
https://issues.apache.org/jira/browse/CASSANDRA-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036752#comment-18036752
]
Branimir Lambov commented on CASSANDRA-20226:
---------------------------------------------
There's another option to consider here. The allocation mechanism does not need
to track individual cell writes. We could just as well track the usage of a
mutation in a single {{allocate}} call after it completes, or track the
allocations with a {{LongAdder}} without checking if the limit is hit, and
check if we need to wait for room before starting to apply a mutation.
We use the {{allocate}} code to decide:
- whether to initiate a flush, when the chosen memory limit is filled to some
ratio
- whether to pause accepting writes, when the chosen memory limit has been
exhausted
For the former use there is absolutely no benefit to make these decisions at
the individual allocation level, as we will wait for the mutation to complete
anyway before flushing anything. For the latter, I'd argue that the
allocation-level tracking is actually hurting us. The reason for this is that
we can have the limit be hit at any time during the application of a mutation,
holding multiple locks (which necessitates the complexity of the {{isBlocking}}
mutation signal), a partial copy of the mutation already written to the
memtable structures, and a likely expanded version of the mutation to be
applied on heap, keeping hold of more total memory than we would if we allowed
the operation to continue.
If, instead, we check the allocation limits _before starting_ a mutation and,
once started, allow it to fully progress to completion, we can avoid this
situation at the cost of being somewhat late to notice that the limit has been
reached. This means that the limit will be breached, but this also happens as
it stands now because we will permit operations to run to completion if the
memtable they have been marked for is scheduled for a flush -- which is
effectively the same thing as not having noticed the memory limit would be
breached by this mutation at the time when we decided to start it.
> Reduce contention in MemtableAllocator.allocate
> -----------------------------------------------
>
> Key: CASSANDRA-20226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20226
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Memtable
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments: 5.1_batch_LongAdder.html, 5.1_batch_addAndGet.html,
> 5.1_batch_alloc_batching.html, 5.1_batch_baseline.html,
> 5.1_batch_pad_allocated.html, CASSANDRA-20226_ci_summary.htm,
> CASSANDRA-20226_results_details.tar.xz,
> ci_summary_netudima_CASSANDRA-20226-trunk_52.html, cpu_profile_batch.html,
> image-2025-01-20-23-38-58-896.png, image-2025-11-10-00-04-57-497.png,
> profile.yaml, results_details_netudima_CASSANDRA-20226-trunk_52.tar.xz,
> test_results_m8i.4xlarge_heap_buffers.html,
> test_results_m8i.4xlarge_heap_buffers.png,
> test_results_m8i.4xlarge_offheap_objects.html,
> test_results_m8i.4xlarge_offheap_objects.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> For a high insert batch rate it looks like we have a bottleneck in
> NativeAllocator.allocate probably caused by contention within the logic.
> !image-2025-01-20-23-38-58-896.png|width=300!
> [^cpu_profile_batch.html]
> The logic has at least the following 2 potential places to assess:
> # allocation cycle in MemtablePool.SubPool#tryAllocate. This logic has a
> while loop with a CAS, which can be non-efficient under a high contention,
> similar to CASSANDRA-15922 we can try to replace it with addAndGet (need to
> check if it does not break the allocator logic)
> # swap region logic in NativeAllocator.trySwapRegion (under a high insert
> rate 1MiB regions can be swapped quite frequently)
> Reproducing test details:
> * test logic
> {code:java}
> ./tools/bin/cassandra-stress "user profile=./profile.yaml no-warmup
> ops(insert=1) n=10m" -rate threads=100 -node somenode
> {code}
> * Cassandra version: 5.0.3
> * configuration changes compared to default:
> {code:java}
> memtable_allocation_type: offheap_objects
> memtable:
> configurations:
> skiplist:
> class_name: SkipListMemtable
> trie:
> class_name: TrieMemtable
> parameters:
> shards: 32
> default:
> inherits: trie
> {code}
> * 1 node cluster
> * OpenJDK jdk-17.0.12+7
> * Linux kernel: 4.18.0-240.el8.x86_64
> * CPU: 16 cores, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> * RAM: 46GiB
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]