[jira] [Comment Edited] (CASSANDRA-19776) Spinning trying to capture readers

Stefan Miklosovic (Jira) Tue, 06 May 2025 00:32:08 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949645#comment-17949645
 ]


Stefan Miklosovic edited comment on CASSANDRA-19776 at 5/6/25 7:31 AM:
-----------------------------------------------------------------------

[~marcuse] Interesting! I have done that here

https://github.com/apache/cassandra/pull/4136

Looks good but I wonder why we went with "actuals" only before. Is there any 
deeper logic behind that? 

[CASSANDRA-19776-4.0-second-approach|https://github.com/instaclustr/cassandra/tree/CASSANDRA-19776-4.0-second-approach]
{noformat}
java11_pre-commit_tests                         
  ✓ j11_build                                        1m 36s
  ✓ j11_cqlsh-dtests-py2-no-vnodes                   5m 22s
  ✓ j11_cqlsh-dtests-py2-with-vnodes                  6m 6s
  ✓ j11_cqlsh_dtests_py3                             5m 32s
  ✓ j11_cqlsh_dtests_py311                           5m 42s
  ✓ j11_cqlsh_dtests_py311_vnode                     5m 45s
  ✓ j11_cqlsh_dtests_py38                            5m 41s
  ✓ j11_cqlsh_dtests_py38_vnode                      5m 51s
  ✓ j11_cqlsh_dtests_py3_vnode                       6m 34s
  ✓ j11_dtests_vnode                                43m 28s
  ✓ j11_jvm_dtests                                  13m 11s
  ✓ j11_unit_tests                                   8m 12s
  ✕ j11_cqlshlib_tests                               4m 53s <-- some flakiness
  ✕ j11_dtests                                      53m 35s
      refresh_test.TestRefresh test_refresh_deadlock_startup
{noformat}

[java11_pre-commit_tests|https://app.circleci.com/pipelines/github/instaclustr/cassandra/5786/workflows/600abf52-20e7-4e67-b561-75ab07935fcb]


was (Author: smiklosovic):
Interesting! I have done that here

https://github.com/apache/cassandra/pull/4136

Looks good but I wonder why we went with "actuals" only before. Is there any 
deeper logic behind that? 

[CASSANDRA-19776-4.0-second-approach|https://github.com/instaclustr/cassandra/tree/CASSANDRA-19776-4.0-second-approach]
{noformat}
java11_pre-commit_tests                         
  ✓ j11_build                                        1m 36s
  ✓ j11_cqlsh-dtests-py2-no-vnodes                   5m 22s
  ✓ j11_cqlsh-dtests-py2-with-vnodes                  6m 6s
  ✓ j11_cqlsh_dtests_py3                             5m 32s
  ✓ j11_cqlsh_dtests_py311                           5m 42s
  ✓ j11_cqlsh_dtests_py311_vnode                     5m 45s
  ✓ j11_cqlsh_dtests_py38                            5m 41s
  ✓ j11_cqlsh_dtests_py38_vnode                      5m 51s
  ✓ j11_cqlsh_dtests_py3_vnode                       6m 34s
  ✓ j11_dtests_vnode                                43m 28s
  ✓ j11_jvm_dtests                                  13m 11s
  ✓ j11_unit_tests                                   8m 12s
  ✕ j11_cqlshlib_tests                               4m 53s <-- some flakiness
  ✕ j11_dtests                                      53m 35s
      refresh_test.TestRefresh test_refresh_deadlock_startup
{noformat}

[java11_pre-commit_tests|https://app.circleci.com/pipelines/github/instaclustr/cassandra/5786/workflows/600abf52-20e7-4e67-b561-75ab07935fcb]

> Spinning trying to capture readers
> ----------------------------------
>
>                 Key: CASSANDRA-19776
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19776
>             Project: Apache Cassandra
>          Issue Type: Bug
>            Reporter: Cameron Zemek
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>         Attachments: extract.log
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> On a handful of clusters we are noticing Spin locks occurring. I traced back 
> all the calls to the EstimatedPartitionCount metric (eg. 
> org.apache.cassandra.metrics:type=Table,keyspace=testks,scope=testcf,name=EstimatedPartitionCount)
> Using the following patched function:
> {code:java}
>     public RefViewFragment selectAndReference(Function<View, 
> Iterable<SSTableReader>> filter)
>     {
>         long failingSince = -1L;
>         boolean first = true;
>         while (true)
>         {
>             ViewFragment view = select(filter);
>             Refs<SSTableReader> refs = Refs.tryRef(view.sstables);
>             if (refs != null)
>                 return new RefViewFragment(view.sstables, view.memtables, 
> refs);
>             if (failingSince <= 0)
>             {
>                 failingSince = System.nanoTime();
>             }
>             else if (System.nanoTime() - failingSince > 
> TimeUnit.MILLISECONDS.toNanos(100))
>             {
>                 List<SSTableReader> released = new ArrayList<>();
>                 for (SSTableReader reader : view.sstables)
>                     if (reader.selfRef().globalCount() == 0)
>                         released.add(reader);
>                 NoSpamLogger.log(logger, NoSpamLogger.Level.WARN, 1, 
> TimeUnit.SECONDS,
>                                  "Spinning trying to capture readers {}, 
> released: {}, ", view.sstables, released);
>                 if (first)
>                 {
>                     first = false;
>                     try {
>                         throw new RuntimeException("Spinning trying to 
> capture readers");
>                     } catch (Exception e) {
>                         logger.warn("Spin lock stacktrace", e);
>                     }
>                 }
>                 failingSince = System.nanoTime();
>             }
>         }
>     }
>  {code}
> Digging into this code I found it will fail if any of the sstables are in 
> released state (ie. reader.selfRef().globalCount() == 0).
> See the extract.log for an example of one of these spin lock occurrences. 
> Sometimes these spin locks last over 5 minutes. Across the worst cluster with 
> this issue, I ran a log processing script that everytime the 'Spinning trying 
> to capture readers' was different to previous one it would output if the 
> released tables were in Compacting state. Every single occurrence has it spin 
> locking with released listing a sstable that is compacting.
> In the extract.log example its spin locking saying that nb-320533-big-Data.db 
> has been released. But you can see prior to it spinning that sstable is 
> involved in a compaction. The compaction completes at 01:03:36 and the 
> spinning stops. nb-320533-big-Data.db is deleted at 01:03:49 along with the 
> other 9 sstables involved in the compaction.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19776) Spinning trying to capture readers

Reply via email to