[ 
https://issues.apache.org/jira/browse/CASSANDRA-19429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946200#comment-17946200
 ] 

Dmitry Konstantinov commented on CASSANDRA-19429:
-------------------------------------------------

Some updates from my side: the issue is reproducing in some way using a real 
application workload and a cluster of Cassandra nodes with 20 cores and 94 Gb 
RAM (31G Cassandra heap size). Cassandra version: 4.1.4. We see periodic spikes 
in thread pool queue sizes (read stage) and higher local read latency.

My theory:
 * we have a lock taken to execute getCapacity (within Caffeine 
[getMaximum()|https://github.com/ben-manes/caffeine/blob/v2.9.2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L3545])
 within read threads by every read operation x number of SSTables which have a 
partition key to read.
 * in normal cases the operation is fast, we take the lock for a small period 
of time and the probability for several threads to try to acquire this lock at 
the same time is not very high, so the contention and parking ratio is low. It 
can be not that good for cases when we have lot of CPU cores and the 
parallelism level is high.
 * But it is not only the factor, another important aspect which I noticed the 
Caffeine lock is taken not only by this fast getCapacity method during the 
processing. It is also taken by [eviction 
logic|https://github.com/ben-manes/caffeine/blob/v2.9.2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1428]
 .
 * So, I suppose we may have the following cache behaviour:
 ** (trigger 1) key cache is fully filled, distribution for partition keys is 
wide and we hits frequently non-cached keys
 ** (trigger 2) compaction makes SSTables and correspondent records in the key 
cache obsolete. So, I would say writing and compaction in executed in the 
previous synthetic tests mentioned in this ticket are not noise creators only 
but potentially important contributors to the issue reproducing, so we should 
not disable them in attempts to reproduce the issue.
 ** as a consequence cache has to evict keys frequently
 ** the eviction logic is heavier than getCapacity(), so the probability that a 
read thread hits the lock occupied by the eviction logic is higher.
 ** read threads are parking to wait for the lock release and we may collect 
several of them in a queue here
 ** when eviction logic has completed now we need to unpark our read threads to 
get the lock and it is happening one by one, unparking is not very cheap and 
fast, so it takes time to process with accumulated "queue" of read threads

Wall-clock (not CPU!) async profiler profile: 
[^wallclock_profile_with_lock_issue.html]
!image-2025-04-21-19-00-55-200.png|width=600!

Cache eviction logic:
!image-2025-04-21-19-02-41-725.png|width=600!

To check how key cache eviction is related to the issue we've increased Key 
cache size from default 100Mb to 1024MB and it helped for while, then read 
latency has started to degrade again as soon as cache size limit has been 
reached.

Key cache disabling by setting capacity to 0 also works as an WA.

We have captured a set of thread dumps and periodically we can observe the 
state when read threads are blocked by an eviction logic caused by SSTables 
invalidation, like:
{code:java}
"NonPeriodicTasks:1" #327 daemon prio=5 os_prio=0 cpu=124133.66ms 
elapsed=85609.60s tid=0x00007fbd749cc200 nid=0xd28 runnable  
[0x00007fbd7beb7000]
   java.lang.Thread.State: RUNNABLE
    at 
com.github.benmanes.caffeine.cache.AccessOrderDeque.remove(AccessOrderDeque.java:53)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache$RemovalTask.run(BoundedLocalCache.java:1726)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.drainWriteBuffer(BoundedLocalCache.java:1610)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1486)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1463)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3361)
    at 
org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1435)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1402)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1372)
    at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.remove(BoundedLocalCache.java:2189)
    at 
com.github.benmanes.caffeine.cache.LocalManualCache.invalidate(LocalManualCache.java:140)
    at org.apache.cassandra.cache.CaffeineCache.remove(CaffeineCache.java:124)
    at org.apache.cassandra.cache.CaffeineCache.remove(CaffeineCache.java:35)
    at 
org.apache.cassandra.cache.InstrumentingCache.remove(InstrumentingCache.java:74)
    at 
org.apache.cassandra.io.sstable.SSTableRewriter$InvalidateKeys.run(SSTableRewriter.java:286)
    at 
org.apache.cassandra.io.sstable.format.SSTableReader$DropPageCache.run(SSTableReader.java:959)
    at 
org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1.run(SSTableReader.java:2085)
    at 
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
    at 
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
{code}
as well eviction during writes to a full cache:
{code:java}
"SharedPool-Worker-20" #425 daemon prio=10 os_prio=0 cpu=1196423.03ms 
elapsed=85384.56s tid=0x00007fbd86a3f200 nid=0x1212 runnable  
[0x00007fbcd30eb000]
   java.lang.Thread.State: RUNNABLE
        at 
com.github.benmanes.caffeine.cache.AccessOrderDeque.remove(AccessOrderDeque.java:53)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntry(BoundedLocalCache.java:973)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictFromMain(BoundedLocalCache.java:755)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntries(BoundedLocalCache.java:612)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1495)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1463)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3361)
        at 
org.apache.cassandra.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:140)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1435)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1402)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1372)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2063)
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2000)
        at 
com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:130)
        at org.apache.cassandra.cache.CaffeineCache.put(CaffeineCache.java:109)
        at org.apache.cassandra.cache.CaffeineCache.put(CaffeineCache.java:35)
        at 
org.apache.cassandra.cache.InstrumentingCache.put(InstrumentingCache.java:43)
        at 
org.apache.cassandra.io.sstable.format.SSTableReader.cacheKey(SSTableReader.java:1341)
        at 
org.apache.cassandra.io.sstable.format.big.BigTableReader.getPosition(BigTableReader.java:273)
        at 
org.apache.cassandra.io.sstable.format.SSTableReader.getPosition(SSTableReader.java:1399)
        at 
org.apache.cassandra.io.sstable.format.big.BigTableReader.rowIterator(BigTableReader.java:66)
        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.initializeIterator(UnfilteredRowIteratorWithLowerBound.java:100)
        at 
org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.maybeInit(LazilyInitializedUnfilteredRowIterator.java:48)
        at 
org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.partitionLevelDeletion(LazilyInitializedUnfilteredRowIterator.java:81)
        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.partitionLevelDeletion(UnfilteredRowIteratorWithLowerBound.java:161)
        at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:726)
        at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:625)
        at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryStorage(SinglePartitionReadCommand.java:459)
        at 
org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:421)
        at 
org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:62)
        at 
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
        at 
org.apache.cassandra.net.InboundSink$$Lambda$619/0x00000017c0d3f440.accept(Unknown
 Source)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
        at 
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
        at 
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.24/Thread.java:829)
{code}
 

I will try to make a synthetic reproducer for this issue.

> Remove lock contention generated by getCapacity function in SSTableReader
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19429
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19429
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Local/SSTable
>            Reporter: Dipietro Salvatore
>            Assignee: Dipietro Salvatore
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x
>
>         Attachments: Screenshot 2024-02-26 at 10.27.10.png, Screenshot 
> 2024-02-27 at 11.29.41.png, Screenshot 2024-03-19 at 15.22.50.png, 
> asprof_cass4.1.3__lock_20240216052912lock.html, 
> image-2024-03-08-15-51-30-439.png, image-2024-03-08-15-52-07-902.png, 
> image-2025-04-21-19-00-55-200.png, image-2025-04-21-19-02-41-725.png, 
> wallclock_profile_with_lock_issue.html
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Profiling Cassandra 4.1.3 on large AWS instances, a high number of lock 
> acquires is measured in the `getCapacity` function from 
> `org/apache/cassandra/cache/InstrumentingCache` (1.9M lock acquires per 60 
> seconds). Based on our tests on r8g.24xlarge instances (using Ubuntu 22.04), 
> this limits the CPU utilization of the system to under 50% when testing at 
> full load and therefore limits the achieved throughput.
> Removing the lock contention from the SSTableReader.java file by replacing 
> the call to `getCapacity` with `size` achieves up to 2.95x increase in 
> throughput on r8g.24xlarge and 2x on r7i.24xlarge:
> |Instance type|Cass 4.1.3|Cass 4.1.3 patched|
> |r8g.24xlarge|168k ops|496k ops (2.95x)|
> |r7i.24xlarge|153k ops|304k ops (1.98x)|
>  
> Instructions to reproduce:
> {code:java}
> ## Requirements for Ubuntu 22.04
> sudo apt install -y ant git openjdk-11-jdk
> ## Build and run
> CASSANDRA_USE_JDK11=true ant realclean && CASSANDRA_USE_JDK11=true ant jar && 
> CASSANDRA_USE_JDK11=true ant stress-build  && rm -rf data && bin/cassandra -f 
> -R
> # Run
> bin/cqlsh -e 'drop table if exists keyspace1.standard1;' && \
> bin/cqlsh -e 'drop keyspace if exists keyspace1;' && \
> bin/nodetool clearsnapshot --all && tools/bin/cassandra-stress write 
> n=10000000 cl=ONE -rate threads=384 -node 127.0.0.1 -log file=cload.log 
> -graph file=cload.html && \
> bin/nodetool compact keyspace1   && sleep 30s && \
> tools/bin/cassandra-stress mixed ratio\(write=10,read=90\) duration=10m 
> cl=ONE -rate threads=406 -node localhost -log file=result.log -graph 
> file=graph.html
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to