[ 
https://issues.apache.org/jira/browse/CASSANDRA-19703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928952#comment-17928952
 ] 

Andy Tolbert edited comment on CASSANDRA-19703 at 2/21/25 12:16 AM:
--------------------------------------------------------------------

I've recently also encountered this issue, and [~yukei]'s explanation of it 
seems right on!

Absent of upgrading Caffeine, I think one way to address this would be ensure 
that the timestamp of the INSERT always precedes the timestamp of the DELETE. 
We can accomplish this pretty trivially by adding {{USING TIMESTAMP}} onto in 
{{SystemKeyspace.writePreparedStatement}} with a timestamp that was generated 
on construction of the {{{}Prepared{}}}. This would ensure the INSERT doesn't 
shadow the DELETE, even if the DELETE happens in the insert.

Additionally, any clusters currently experiencing a leaky 
system.prepared_statements table from this bug may struggle to bounce into a 
version with this fix as
SystemKeyspace.loadPreparedPreparedStatements currently does not paginate the 
query to system.prepared_statements, causing heap OOMs. To fix this this patch 
adds pagination at 5000 rows, which should allow nodes to come up and delete 
older prepared statements that may no longer be used as
the cache fills up (which should happen immediately).

I've created a proposed solution in [GitHub Pull Request 
#3917|https://github.com/apache/cassandra/pull/3917].

Note that this patch does not address the issue of Caffeine immediately 
evicting a prepared statement, however it will prevent the
system.prepared_statements table from growing unbounded. For most I think this 
should be adequate, as the cache should only be filled when there are 
erroneously many unique prepared statements. In such a case we can expect that 
clients will constantly (re)prepare statements regardless of whether or not the 
cache is evicting statements.


was (Author: andrew.tolbert):
I've recently also encountered this issue, and [~yukei]'s explanation of it 
seems right on!

Absent of upgrading Caffeine, I think one way to address this would be ensure 
that the timestamp of the INSERT always precedes the timestamp of the DELETE. 
We can accomplish this pretty trivially by adding {{USING TIMESTAMP}} onto in 
{{SystemKeyspace.writePreparedStatement}} with a timestamp that was generated 
on construction of the {{{}Prepared{}}}. This would ensure the INSERT doesn't 
shadow the DELETE, even if the DELETE happens in the insert.

Additionally, any clusters currently experiencing a leaky 
system.prepared_statements table from this bug may struggle to bounce into a 
version with this fix as
SystemKeyspace.loadPreparedPreparedStatements currently does not paginate the 
query to system.prepared_statements, causing heap OOMs. To fix this this patch 
adds pagingation at 5000 rows, which should allow nodes to come up and delete 
older prepared statements that may no longer be used as
the cache fills up (which should happen immediately).

I've created a proposed solution in [GitHub Pull Request 
#3917|https://github.com/apache/cassandra/pull/3917].

Note that this patch does not address the issue of Caffeine immediately 
evicting a prepared statement, however it will prevent the
system.prepared_statements table from growing unbounded. For most I think this 
should be adequate, as the cache should only be filled when there are 
erroneously many unique prepared statements. In such a case we can expect that 
clients will constantly (re)prepare statements regardless of whether or not the 
cache is evicting statements.

> Newly inserted prepared statements got evicted too early from cache that 
> leads to race condition
> ------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19703
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19703
>             Project: Apache Cassandra
>          Issue Type: Bug
>            Reporter: Yuqi Yan
>            Assignee: Cameron Zemek
>            Priority: Normal
>             Fix For: 4.1.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We're upgrading from Cassandra 4.0 to Cassandra 4.1.3 and 
> system.prepared_statements table size start growing to GB size after upgrade. 
> This slows down node startup significantly when it's doing 
> preloadPreparedStatements
> I can't share the exact log but it's a race condition like this:
>  # [Thread 1] Receives a prepared request for S1. Attempts to get S1 in cache
>  # [Thread 1] Cache miss, put this S1 into cache
>  # [Thread 1] Attempts to write S1 into local table
>  # [Thread 2] Receives a prepared request for S2. Attempts to get S2 in cache
>  # [Thread 2] Cache miss, put this S2 into cache
>  # [Thread 2] Cache is full, evicting S1 from cache
>  # [Thread 2] Attempts to delete S1 from local table
>  # [Thread 2] Tombstone inserted for S1, delete finished
>  # [Thread 1] Record inserted for S1, write finished
> Thread 2 inserted a tombstone for S1 earlier than Thread 1 was able to insert 
> the record in the table. Hence the data will not be removed because the later 
> insert has newer write time than the tombstone.
> Whether this would happen or not depends on how the cache decides what’s the 
> next entry to evict when it’s full. We noticed that in 4.1.3 Caffeine was 
> upgraded to 2.9.2 CASSANDRA-15153
>  
> I did a small research in Caffeine commits. It seems this commit was causing 
> the entry got evicted to early: Eagerly evict an entry if it too large to fit 
> in the cache(Feb 2021), available after 2.9.0: 
> [https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b]
> And later fixed in: Improve eviction when overflow or the weight is 
> oversized(Aug 2022), available after 3.1.2: 
> [https://github.com/ben-manes/caffeine/commit/25b7d17b1a246a63e4991d4902a2ecf24e86d234]
> {quote}Previously an attempt to centralize evictions into one code path led 
> to a suboptimal approach 
> ([{{464bc19}}|https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b]
> ). This tried to move those entries into the LRU position for early eviction, 
> but was confusing and could too aggressively evict something that is 
> desirable to keep.
> {quote}
>  
> I upgrade the Caffeine to 3.1.8 (same as 5.0 trunk) and this issue is gone. 
> But I think this version is not compatible with Java 8.
> I'm not 100% sure if this is the root cause and what's the correct fix here. 
> Would appreciate if anyone can have a look, thanks
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to