[ 
https://issues.apache.org/jira/browse/CASSANDRA-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17933625#comment-17933625
 ] 

Dmitry Konstantinov commented on CASSANDRA-19101:
-------------------------------------------------

Hi, I have got this test failed recently 
([https://ci-cassandra.apache.org/job/Cassandra-trunk/2057/#showFailuresLink] 
), so I decided to try analyzed it deeper.
I think the cause why the test is failing sometimes: additionally to the 
writing to CommitLogTest.Standard1 table we implicitly write to system tables 
such as system.local, system_schema.keyspaces, sstable_activity, etc. These 
writes also are captured in commit log and we do not force flushing of them to 
release the correspondent commit log segments, so depends on timing we may have 
additional commit log segments are hold by these dirty tables.

[^Cassandra-trunk_2057_CommitlogShutdownTest.log] - in the CI log we can see 
that there is some parallel background activity while we execute our test logic 
in "main" thread. For example, after the discard check in main thread:
{code:java}
DEBUG [main] 2025-03-07 21:55:22,733 AbstractCommitLogSegmentManager.java:412 - 
Segment 
CommitLogSegment(build/test/cassandra/commitlog/CommitLog-9-1741384520601.log) 
is no longer active and will be deleted now
{code}
we see in the log a completion of system.local table flushing (which means that 
at that moment commit log data for system.local cannot be discarded yet.
{code:java}
INFO  [PerDiskMemtableFlushWriter_0:2] 2025-03-07 21:55:22,736 
Flushing.java:153 - Writing Memtable-local@1560781103(89B serialized bytes, 2 
ops, 672B (0%) on-heap, 0B (0%) off-heap), flushed range = [null, null)

INFO  [PerDiskMemtableFlushWriter_0:2] 2025-03-07 21:55:22,736 
Flushing.java:179 - Completed flushing 
/home/cassandra/cassandra/build/test/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/oa-5-big-Data.db
 (37B) for commitlog position CommitLogPosition(segmentId=1741384520600, 
position=786787) {code}
Also, I am lucky to be able to reproduce the issue quite frequently on my 
laptop and I have changed temporary logging in 
org.apache.cassandra.db.commitlog.CommitLog#discardCompletedSegments from 
tracing to debug level, so I can see what prevents commit log segments from 
discarding:
{code:java}
DEBUG [main] 2025-03-08 11:49:11,145 CommitLog.java:375 - Not safe to delete 
commit log segment 
CommitLogSegment(build/test/cassandra/commitlog/CommitLog-9-1741423746535.log); 
dirty is keyspaces (abac5682-dea6-31c5-b535-b3d6cffd0fb6, dirty: [89,15547], 
clean: null), columns (24101c25-a2ae-3af7-87c1-b40ee1aca33f, dirty: [89,15547], 
clean: null), tables (afddfb9d-bc1e-3068-8056-eed6c302ba09, dirty: [89,15547], 
clean: null), column_masks (738cc5ed-0168-3268-b9d1-853d4bc278af, dirty: 
[15547,15547], clean: null), 
DEBUG [main] 2025-03-08 11:49:11,146 CommitLog.java:375 - Not safe to delete 
commit log segment 
CommitLogSegment(build/test/cassandra/commitlog/CommitLog-9-1741423746536.log); 
dirty is local (7ad54392-bcdd-35a6-8417-4e047860b377, dirty: [786713,786713], 
clean: null), 
{code}
So, it is an additional proof for the hypothesis which I listed before.

I see the following possible ways to deal with the issue:
 # Try to flush all system tables + to disable their modifications by disabling 
background activities. While it looks possible  I think it is a fragile 
approach: we need to find all such background activities/tables to make the 
test reliable + if later some feature introduces a new activity/table our test 
will become unstable again. So, I would prefer to not go in this way.
 # Change the test: instead of attempts to synthetically emulate writes to 
commit log for a table to check if segments are discarded after it we can test 
that the whole drain operation does the discarding (this was the original 
reason to introduce this test: CASSANDRA-13123). It may reduce the probability 
of catching concurrency bugs because drain operation is more complex and have 
other steps but it is more reliable way to test it as a black box and do not 
make assumptions about the system.
 # Just delete the test. It was introduced a long time ago in CASSANDRA-13123 
to check async discarding logic which we had in 3.0 but now 
org.apache.cassandra.db.commitlog.AbstractCommitLogSegmentManager#discard 
method is not submitting any tasks to a separate "segment management thread" 
and does the discarding logic by itself (this change happened in CASSANDRA-8844:
!diff_discard_logic.png|width=500!
So, the original failure scenario the test tries to check does not exist 
anymore.
[~smiklosovic] Regarding your confusion about the injection. The original async 
logic mentioned here is the reason why the sleep was injected by the test: it 
is used to increase time to process tasks by this management thread and in this 
way to increase the probability to get the state then 
CommitLog.instance.shutdownBlocking() is invoked we have not process yet tasks 
within the queue and the test can assert shutdown logic waits for their 
completion instead of just dropping/ignore them.

[~smiklosovic]  [~jmckenzie] do you have a preference for the listed options? 
(I would prefer to go with 2 or 3)

> Test Failure: org.apache.cassandra.db.commitlog.CommitlogShutdownTest failed 
> on trunk
> -------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19101
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19101
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: Jacek Lewandowski
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: Cassandra-trunk_2057_CommitlogShutdownTest.log, 
> diff_discard_logic.png
>
>
> {noformat}
> java.lang.AssertionError
>       at org.junit.Assert.fail(Assert.java:86)
>       at org.junit.Assert.assertTrue(Assert.java:41)
>       at org.junit.Assert.assertTrue(Assert.java:52)
>       at 
> org.apache.cassandra.db.commitlog.CommitlogShutdownTest.testShutdownWithPendingTasks(CommitlogShutdownTest.java:96)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.jboss.byteman.contrib.bmunit.BMUnitRunner$10.evaluate(BMUnitRunner.java:393)
>       at 
> org.jboss.byteman.contrib.bmunit.BMUnitRunner$6.evaluate(BMUnitRunner.java:263)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at 
> org.jboss.byteman.contrib.bmunit.BMUnitRunner$1.evaluate(BMUnitRunner.java:97)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>       at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
>       at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
>       at 
> com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
>       at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
>       at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:232)
>       at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:55)
> {noformat}
> Manual testing to confirm issues found by CircleCI when testing 
> CASSANDRA-18464. Run with Java 11 / IntelliJ



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to