[ https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949645#comment-17949645 ]
Stefan Miklosovic edited comment on CASSANDRA-19776 at 5/6/25 7:31 AM: ----------------------------------------------------------------------- [~marcuse] Interesting! I have done that here https://github.com/apache/cassandra/pull/4136 Looks good but I wonder why we went with "actuals" only before. Is there any deeper logic behind that? [CASSANDRA-19776-4.0-second-approach|https://github.com/instaclustr/cassandra/tree/CASSANDRA-19776-4.0-second-approach] {noformat} java11_pre-commit_tests ✓ j11_build 1m 36s ✓ j11_cqlsh-dtests-py2-no-vnodes 5m 22s ✓ j11_cqlsh-dtests-py2-with-vnodes 6m 6s ✓ j11_cqlsh_dtests_py3 5m 32s ✓ j11_cqlsh_dtests_py311 5m 42s ✓ j11_cqlsh_dtests_py311_vnode 5m 45s ✓ j11_cqlsh_dtests_py38 5m 41s ✓ j11_cqlsh_dtests_py38_vnode 5m 51s ✓ j11_cqlsh_dtests_py3_vnode 6m 34s ✓ j11_dtests_vnode 43m 28s ✓ j11_jvm_dtests 13m 11s ✓ j11_unit_tests 8m 12s ✕ j11_cqlshlib_tests 4m 53s <-- some flakiness ✕ j11_dtests 53m 35s refresh_test.TestRefresh test_refresh_deadlock_startup {noformat} [java11_pre-commit_tests|https://app.circleci.com/pipelines/github/instaclustr/cassandra/5786/workflows/600abf52-20e7-4e67-b561-75ab07935fcb] was (Author: smiklosovic): Interesting! I have done that here https://github.com/apache/cassandra/pull/4136 Looks good but I wonder why we went with "actuals" only before. Is there any deeper logic behind that? [CASSANDRA-19776-4.0-second-approach|https://github.com/instaclustr/cassandra/tree/CASSANDRA-19776-4.0-second-approach] {noformat} java11_pre-commit_tests ✓ j11_build 1m 36s ✓ j11_cqlsh-dtests-py2-no-vnodes 5m 22s ✓ j11_cqlsh-dtests-py2-with-vnodes 6m 6s ✓ j11_cqlsh_dtests_py3 5m 32s ✓ j11_cqlsh_dtests_py311 5m 42s ✓ j11_cqlsh_dtests_py311_vnode 5m 45s ✓ j11_cqlsh_dtests_py38 5m 41s ✓ j11_cqlsh_dtests_py38_vnode 5m 51s ✓ j11_cqlsh_dtests_py3_vnode 6m 34s ✓ j11_dtests_vnode 43m 28s ✓ j11_jvm_dtests 13m 11s ✓ j11_unit_tests 8m 12s ✕ j11_cqlshlib_tests 4m 53s <-- some flakiness ✕ j11_dtests 53m 35s refresh_test.TestRefresh test_refresh_deadlock_startup {noformat} [java11_pre-commit_tests|https://app.circleci.com/pipelines/github/instaclustr/cassandra/5786/workflows/600abf52-20e7-4e67-b561-75ab07935fcb] > Spinning trying to capture readers > ---------------------------------- > > Key: CASSANDRA-19776 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19776 > Project: Apache Cassandra > Issue Type: Bug > Reporter: Cameron Zemek > Assignee: Stefan Miklosovic > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: extract.log > > Time Spent: 10m > Remaining Estimate: 0h > > On a handful of clusters we are noticing Spin locks occurring. I traced back > all the calls to the EstimatedPartitionCount metric (eg. > org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount) > Using the following patched function: > {code:java} > public RefViewFragment selectAndReference(Function<View, > Iterable<SSTableReader>> filter) > { > long failingSince = -1L; > boolean first = true; > while (true) > { > ViewFragment view = select(filter); > Refs<SSTableReader> refs = Refs.tryRef(view.sstables); > if (refs != null) > return new RefViewFragment(view.sstables, view.memtables, > refs); > if (failingSince <= 0) > { > failingSince = System.nanoTime(); > } > else if (System.nanoTime() - failingSince > > TimeUnit.MILLISECONDS.toNanos(100)) > { > List<SSTableReader> released = new ArrayList<>(); > for (SSTableReader reader : view.sstables) > if (reader.selfRef().globalCount() == 0) > released.add(reader); > NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, > TimeUnit.SECONDS, > "Spinning trying to capture readers {}, > released: {}, ", view.sstables, released); > if (first) > { > first = false; > try { > throw new RuntimeException("Spinning trying to > capture readers"); > } catch (Exception e) { > logger.warn("Spin lock stacktrace", e); > } > } > failingSince = System.nanoTime(); > } > } > } > {code} > Digging into this code I found it will fail if any of the sstables are in > released state (ie. reader.selfRef().globalCount() == 0). > See the extract.log for an example of one of these spin lock occurrences. > Sometimes these spin locks last over 5 minutes. Across the worst cluster with > this issue, I ran a log processing script that everytime the 'Spinning trying > to capture readers' was different to previous one it would output if the > released tables were in Compacting state. Every single occurrence has it spin > locking with released listing a sstable that is compacting. > In the extract.log example its spin locking saying that nb-320533-big-Data.db > has been released. But you can see prior to it spinning that sstable is > involved in a compaction. The compaction completes at 01:03:36 and the > spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the > other 9 sstables involved in the compaction. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org