[ https://issues.apache.org/jira/browse/CASSANDRA-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17933625#comment-17933625 ]
Dmitry Konstantinov commented on CASSANDRA-19101: ------------------------------------------------- Hi, I have got this test failed recently ([https://ci-cassandra.apache.org/job/Cassandra-trunk/2057/#showFailuresLink] ), so I decided to try analyzed it deeper. I think the cause why the test is failing sometimes: additionally to the writing to CommitLogTest.Standard1 table we implicitly write to system tables such as system.local, system_schema.keyspaces, sstable_activity, etc. These writes also are captured in commit log and we do not force flushing of them to release the correspondent commit log segments, so depends on timing we may have additional commit log segments are hold by these dirty tables. [^Cassandra-trunk_2057_CommitlogShutdownTest.log] - in the CI log we can see that there is some parallel background activity while we execute our test logic in "main" thread. For example, after the discard check in main thread: {code:java} DEBUG [main] 2025-03-07 21:55:22,733 AbstractCommitLogSegmentManager.java:412 - Segment CommitLogSegment(build/test/cassandra/commitlog/CommitLog-9-1741384520601.log) is no longer active and will be deleted now {code} we see in the log a completion of system.local table flushing (which means that at that moment commit log data for system.local cannot be discarded yet. {code:java} INFO [PerDiskMemtableFlushWriter_0:2] 2025-03-07 21:55:22,736 Flushing.java:153 - Writing Memtable-local@1560781103(89B serialized bytes, 2 ops, 672B (0%) on-heap, 0B (0%) off-heap), flushed range = [null, null) INFO [PerDiskMemtableFlushWriter_0:2] 2025-03-07 21:55:22,736 Flushing.java:179 - Completed flushing /home/cassandra/cassandra/build/test/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/oa-5-big-Data.db (37B) for commitlog position CommitLogPosition(segmentId=1741384520600, position=786787) {code} Also, I am lucky to be able to reproduce the issue quite frequently on my laptop and I have changed temporary logging in org.apache.cassandra.db.commitlog.CommitLog#discardCompletedSegments from tracing to debug level, so I can see what prevents commit log segments from discarding: {code:java} DEBUG [main] 2025-03-08 11:49:11,145 CommitLog.java:375 - Not safe to delete commit log segment CommitLogSegment(build/test/cassandra/commitlog/CommitLog-9-1741423746535.log); dirty is keyspaces (abac5682-dea6-31c5-b535-b3d6cffd0fb6, dirty: [89,15547], clean: null), columns (24101c25-a2ae-3af7-87c1-b40ee1aca33f, dirty: [89,15547], clean: null), tables (afddfb9d-bc1e-3068-8056-eed6c302ba09, dirty: [89,15547], clean: null), column_masks (738cc5ed-0168-3268-b9d1-853d4bc278af, dirty: [15547,15547], clean: null), DEBUG [main] 2025-03-08 11:49:11,146 CommitLog.java:375 - Not safe to delete commit log segment CommitLogSegment(build/test/cassandra/commitlog/CommitLog-9-1741423746536.log); dirty is local (7ad54392-bcdd-35a6-8417-4e047860b377, dirty: [786713,786713], clean: null), {code} So, it is an additional proof for the hypothesis which I listed before. I see the following possible ways to deal with the issue: # Try to flush all system tables + to disable their modifications by disabling background activities. While it looks possible I think it is a fragile approach: we need to find all such background activities/tables to make the test reliable + if later some feature introduces a new activity/table our test will become unstable again. So, I would prefer to not go in this way. # Change the test: instead of attempts to synthetically emulate writes to commit log for a table to check if segments are discarded after it we can test that the whole drain operation does the discarding (this was the original reason to introduce this test: CASSANDRA-13123). It may reduce the probability of catching concurrency bugs because drain operation is more complex and have other steps but it is more reliable way to test it as a black box and do not make assumptions about the system. # Just delete the test. It was introduced a long time ago in CASSANDRA-13123 to check async discarding logic which we had in 3.0 but now org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager#discard method is not submitting any tasks to a separate "segment management thread" and does the discarding logic by itself (this change happened in CASSANDRA-8844: !diff_discard_logic.png|width=500! So, the original failure scenario the test tries to check does not exist anymore. [~smiklosovic] Regarding your confusion about the injection. The original async logic mentioned here is the reason why the sleep was injected by the test: it is used to increase time to process tasks by this management thread and in this way to increase the probability to get the state then CommitLog.instance.shutdownBlocking() is invoked we have not process yet tasks within the queue and the test can assert shutdown logic waits for their completion instead of just dropping/ignore them. [~smiklosovic] [~jmckenzie] do you have a preference for the listed options? (I would prefer to go with 2 or 3) > Test Failure: org.apache.cassandra.db.commitlog.CommitlogShutdownTest failed > on trunk > ------------------------------------------------------------------------------------- > > Key: CASSANDRA-19101 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19101 > Project: Apache Cassandra > Issue Type: Bug > Components: Test/unit > Reporter: Jacek Lewandowski > Assignee: Dmitry Konstantinov > Priority: Normal > Fix For: 5.x > > Attachments: Cassandra-trunk_2057_CommitlogShutdownTest.log, > diff_discard_logic.png > > > {noformat} > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.cassandra.db.commitlog.CommitlogShutdownTest.testShutdownWithPendingTasks(CommitlogShutdownTest.java:96) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.jboss.byteman.contrib.bmunit.BMUnitRunner$10.evaluate(BMUnitRunner.java:393) > at > org.jboss.byteman.contrib.bmunit.BMUnitRunner$6.evaluate(BMUnitRunner.java:263) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at > org.jboss.byteman.contrib.bmunit.BMUnitRunner$1.evaluate(BMUnitRunner.java:97) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) > at > com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38) > at > com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11) > at > com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35) > at > com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:232) > at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:55) > {noformat} > Manual testing to confirm issues found by CircleCI when testing > CASSANDRA-18464. Run with Java 11 / IntelliJ -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org