[ https://issues.apache.org/jira/browse/CASSANDRA-19703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928952#comment-17928952 ]
Andy Tolbert edited comment on CASSANDRA-19703 at 2/21/25 12:16 AM: -------------------------------------------------------------------- I've recently also encountered this issue, and [~yukei]'s explanation of it seems right on! Absent of upgrading Caffeine, I think one way to address this would be ensure that the timestamp of the INSERT always precedes the timestamp of the DELETE. We can accomplish this pretty trivially by adding {{USING TIMESTAMP}} onto in {{SystemKeyspace.writePreparedStatement}} with a timestamp that was generated on construction of the {{{}Prepared{}}}. This would ensure the INSERT doesn't shadow the DELETE, even if the DELETE happens in the insert. Additionally, any clusters currently experiencing a leaky system.prepared_statements table from this bug may struggle to bounce into a version with this fix as SystemKeyspace.loadPreparedPreparedStatements currently does not paginate the query to system.prepared_statements, causing heap OOMs. To fix this this patch adds pagination at 5000 rows, which should allow nodes to come up and delete older prepared statements that may no longer be used as the cache fills up (which should happen immediately). I've created a proposed solution in [GitHub Pull Request #3917|https://github.com/apache/cassandra/pull/3917]. Note that this patch does not address the issue of Caffeine immediately evicting a prepared statement, however it will prevent the system.prepared_statements table from growing unbounded. For most I think this should be adequate, as the cache should only be filled when there are erroneously many unique prepared statements. In such a case we can expect that clients will constantly (re)prepare statements regardless of whether or not the cache is evicting statements. was (Author: andrew.tolbert): I've recently also encountered this issue, and [~yukei]'s explanation of it seems right on! Absent of upgrading Caffeine, I think one way to address this would be ensure that the timestamp of the INSERT always precedes the timestamp of the DELETE. We can accomplish this pretty trivially by adding {{USING TIMESTAMP}} onto in {{SystemKeyspace.writePreparedStatement}} with a timestamp that was generated on construction of the {{{}Prepared{}}}. This would ensure the INSERT doesn't shadow the DELETE, even if the DELETE happens in the insert. Additionally, any clusters currently experiencing a leaky system.prepared_statements table from this bug may struggle to bounce into a version with this fix as SystemKeyspace.loadPreparedPreparedStatements currently does not paginate the query to system.prepared_statements, causing heap OOMs. To fix this this patch adds pagingation at 5000 rows, which should allow nodes to come up and delete older prepared statements that may no longer be used as the cache fills up (which should happen immediately). I've created a proposed solution in [GitHub Pull Request #3917|https://github.com/apache/cassandra/pull/3917]. Note that this patch does not address the issue of Caffeine immediately evicting a prepared statement, however it will prevent the system.prepared_statements table from growing unbounded. For most I think this should be adequate, as the cache should only be filled when there are erroneously many unique prepared statements. In such a case we can expect that clients will constantly (re)prepare statements regardless of whether or not the cache is evicting statements. > Newly inserted prepared statements got evicted too early from cache that > leads to race condition > ------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-19703 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19703 > Project: Apache Cassandra > Issue Type: Bug > Reporter: Yuqi Yan > Assignee: Cameron Zemek > Priority: Normal > Fix For: 4.1.x > > Time Spent: 10m > Remaining Estimate: 0h > > We're upgrading from Cassandra 4.0 to Cassandra 4.1.3 and > system.prepared_statements table size start growing to GB size after upgrade. > This slows down node startup significantly when it's doing > preloadPreparedStatements > I can't share the exact log but it's a race condition like this: > # [Thread 1] Receives a prepared request for S1. Attempts to get S1 in cache > # [Thread 1] Cache miss, put this S1 into cache > # [Thread 1] Attempts to write S1 into local table > # [Thread 2] Receives a prepared request for S2. Attempts to get S2 in cache > # [Thread 2] Cache miss, put this S2 into cache > # [Thread 2] Cache is full, evicting S1 from cache > # [Thread 2] Attempts to delete S1 from local table > # [Thread 2] Tombstone inserted for S1, delete finished > # [Thread 1] Record inserted for S1, write finished > Thread 2 inserted a tombstone for S1 earlier than Thread 1 was able to insert > the record in the table. Hence the data will not be removed because the later > insert has newer write time than the tombstone. > Whether this would happen or not depends on how the cache decides what’s the > next entry to evict when it’s full. We noticed that in 4.1.3 Caffeine was > upgraded to 2.9.2 CASSANDRA-15153 > > I did a small research in Caffeine commits. It seems this commit was causing > the entry got evicted to early: Eagerly evict an entry if it too large to fit > in the cache(Feb 2021), available after 2.9.0: > [https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b] > And later fixed in: Improve eviction when overflow or the weight is > oversized(Aug 2022), available after 3.1.2: > [https://github.com/ben-manes/caffeine/commit/25b7d17b1a246a63e4991d4902a2ecf24e86d234] > {quote}Previously an attempt to centralize evictions into one code path led > to a suboptimal approach > ([{{464bc19}}|https://github.com/ben-manes/caffeine/commit/464bc1914368c47a0203517fda2151fbedaf568b] > ). This tried to move those entries into the LRU position for early eviction, > but was confusing and could too aggressively evict something that is > desirable to keep. > {quote} > > I upgrade the Caffeine to 3.1.8 (same as 5.0 trunk) and this issue is gone. > But I think this version is not compatible with Java 8. > I'm not 100% sure if this is the root cause and what's the correct fix here. > Would appreciate if anyone can have a look, thanks > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org