[ https://issues.apache.org/jira/browse/CASSANDRA-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507844#comment-17507844 ]
James Brown edited comment on CASSANDRA-17172 at 3/16/22, 7:53 PM: ------------------------------------------------------------------- Indeed; we've never dropped a single {{VALIDATION_REQ}} message, as far as I can tell, but half the responses get dropped. Is it a very large message? I assume it includes all the merkel trees? We're not doing subrange repairs (just regular range repairs), so maybe we just have too much data in each range? Each node has 16 tokens and there are 12 nodes per datacenter so there should be a total of 192 ranges? was (Author: jbrownep): Indeed; we've never dropped a single {{VALIDATION_REQ}} message, as far as I can tell, but half the responses get dropped. Is it a very large message? I assume it includes all the merkel trees? We're not doing subrange repairs (just regular range repairs), so maybe we just have too much data in each range? > incremental repairs get stuck often > ----------------------------------- > > Key: CASSANDRA-17172 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17172 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair > Reporter: James Brown > Priority: Normal > Fix For: 4.0.x > > > We're on 4.0.1 and switched to incremental repairs shortly after upgrading to > 4.0.x. They work fine about 95% of the time, but once in a while a session > will get stuck and will have to be cancelled (with `nodetool repair_admin > cancel -s <uuid>`). Typically the session will be in REPAIRING but nothing > will actually be happening. > Output of nodetool repair_admin: > {code} > $ nodetool repair_admin > id | state | last activity | > coordinator | participants > > > > > > | participants_wp > 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s) | > /[fd00:ea51:d057:200:1:0:0:8e]:25472 | > fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576 > | > [fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472 > {code} > Running `jstack` on the coordinator shows two repair threads, both idle: > {code} > "Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms elapsed=57359.81s > tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition [0x00007fc529f9a000] > java.lang.Thread.State: TIMED_WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method) > - parking to wait for <0x000000045ba93a18> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123) > at > java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458) > at > java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.11/Thread.java:829) > "Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s > tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition [0x00007fc535aae000] > java.lang.Thread.State: TIMED_WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method) > - parking to wait for <0x00000004c45bf7d8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:2123) > at > java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.11/LinkedBlockingQueue.java:458) > at > java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1053) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.11/Thread.java:829) > {code} > nodetool netstats says there is nothing happening: > {code} > $ nodetool netstats | head -n 2 > Mode: NORMAL > Not sending any streams. > {code} > There's nothing interesting in the logs for this repair; the last relevant > thing was a bunch of "Created 0 sync tasks based on 6 merkle tree responses > for 3a059b10-4ef6-11ec- > 925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last couple of > hours with things like > {code} > 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO [OptionalTasks:1] > LocalSessions.java:938 - Attempting to learn the outcome of unfinished local > incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 > 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO [AntiEntropyStage:1] > LocalSessions.java:987 - Received StatusResponse for repair session > 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not > actionable. Doing nothing. > {code} > Typically, cancelling the session and rerunning with the exact same command > line will succeed. > I haven't witnessed this behavior in our testing cluster; it seems to only > happen on biggish keyspaces. > I know there have historically been issues which this, which is why tools > like Reaper kill any repair that takes more than a short period of time, but > I also thought they were expected to have been fixed with 4.0. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org