[ https://issues.apache.org/jira/browse/CASSANDRA-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507830#comment-17507830 ]
James Brown commented on CASSANDRA-17172: ----------------------------------------- This is now happening on about 50% of all repairs. The cluster is not heavily-loaded and isn't dropping mutation messages regularly (on this machine I'm looking at, {{nodetool tpstats}} says it's dropped 206 MUTATION_RSP messages (out of 6911491 MutationStage tasks, so 0.003%) and 32 VALIDATION_RSP messages (out of 120, I believe, so 27%). We're using G1GC and the longest GC pause we've seen is 30ms. {{ VALIDATION_RSP messages were dropped in last 5000 ms: 0 internal and 9 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 21356 ms }} This particular cluster has 24 machines (12 each in two datacenters) with a few TB of data total. Can you think of anything else > incremental repairs get stuck often > ----------------------------------- > > Key: CASSANDRA-17172 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17172 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair > Reporter: James Brown > Priority: Normal > Fix For: 4.0.x > > > We're on 4.0.1 and switched to incremental repairs shortly after upgrading to > 4.0.x. They work fine about 95% of the time, but once in a while a session > will get stuck and will have to be cancelled (with `nodetool repair_admin > cancel -s <uuid>`). Typically the session will be in REPAIRING but nothing > will actually be happening. > Output of nodetool repair_admin: > {code} > $ nodetool repair_admin > id | state | last activity | > coordinator | participants > > > > > > | participants_wp > 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s) | > /[fd00:ea51:d057:200:1:0:0:8e]:25472 | > fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576 > | > [fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472 > {code} > Running `jstack` on the coordinator shows two repair threads, both idle: > {code} > "Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms elapsed=57359.81s > tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition [0x00007fc529f9a000] > java.lang.Thread.State: TIMED_WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method) > - parking to wait for <0x000000045ba93a18> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123) > at > java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458) > at > java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.11/Thread.java:829) > "Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s > tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition [0x00007fc535aae000] > java.lang.Thread.State: TIMED_WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method) > - parking to wait for <0x00000004c45bf7d8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123) > at > java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458) > at > java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.11/Thread.java:829) > {code} > nodetool netstats says there is nothing happening: > {code} > $ nodetool netstats | head -n 2 > Mode: NORMAL > Not sending any streams. > {code} > There's nothing interesting in the logs for this repair; the last relevant > thing was a bunch of "Created 0 sync tasks based on 6 merkle tree responses > for 3a059b10-4ef6-11ec- > 925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last couple of > hours with things like > {code} > 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO [OptionalTasks:1] > LocalSessions.java:938 - Attempting to learn the outcome of unfinished local > incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 > 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO [AntiEntropyStage:1] > LocalSessions.java:987 - Received StatusResponse for repair session > 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not > actionable. Doing nothing. > {code} > Typically, cancelling the session and rerunning with the exact same command > line will succeed. > I haven't witnessed this behavior in our testing cluster; it seems to only > happen on biggish keyspaces. > I know there have historically been issues which this, which is why tools > like Reaper kill any repair that takes more than a short period of time, but > I also thought they were expected to have been fixed with 4.0. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org