I would like to discuss (1). The problem is that we sometimes see that when a metric like EstimatedPartitionCount is called while a compaction is in progress, it might spin endlessly until compaction finishes.
The reason it spins is that (summarized here (2)) when compaction evaluates some SSTable as expired / to be dropped, that SSTable will not be physically removed until the very end of compaction and its SSTable "tidier" is set which will eventually remove the files on disk after transaction is finished etc. When nobody references it, if EstimatedPartitionCount calls selectAndReference on an SSTable, it will spin, because it waits for a reference which is just not there because it was "unreferenced" already, just not deleted. It is in some kind of a limbo. Branimir Lambov suggested that it is probably not a good idea to reference expired SSTables on CANONICAL (3) My idea was to do this (4), isMarkedCompacted does public boolean isMarkedCompacted() { return tidy.global.obsoletion != null; } which is not null when it is going to be removed from disk / nobody references it. So, we will filter such SSTables out. Jaydeepkumar Chovatia suggested that this approach might lead to "serious repercussions" (5) and we should not touch it and we should do this instead (6). However, that is not possible, because as Branimir mentioned: "The selectAndReference call in estimatedPartitionCount was added recently to fix a race that caused node failures when an sstable disappears while it's being processed.". Worth to say that the usage of selectAndReference seems to be not used consistently across the metrics. That also opens an issue of whether we should not approach this more holistically and cover all cases like this. Do you also see (4) as risky? I built it for 4.0 and CI seems to pass minus one test where we are testing this very CANONICAL functionality. What are your takes here? Regards (1) https://issues.apache.org/jira/browse/CASSANDRA-19776 (2) https://issues.apache.org/jira/browse/CASSANDRA-19776?focusedCommentId=17950873&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17950873 (3) https://issues.apache.org/jira/browse/CASSANDRA-19776?focusedCommentId=17950979&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17950979 (4) https://github.com/apache/cassandra/pull/4156/files#diff-92c8e689de9c33eb580a18eef6d7db02d1fb089183c32c8c8d99344d0964326c (5) https://issues.apache.org/jira/browse/CASSANDRA-19776?focusedCommentId=17952394&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17952394 (6) https://issues.apache.org/jira/browse/CASSANDRA-19776?focusedCommentId=17952747&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17952747 (7) https://app.circleci.com/pipelines/github/instaclustr/cassandra/5803/workflows/0935b05f-e246-463f-95fc-6dcc3822d611