Hello! Using Cassandra 2.2.11, I observe behaviour, that is very similar to https://issues.apache.org/jira/browse/CASSANDRA-12860
Steps to reproduce: 1. Set up a cluster: ccm create five -v 2.2.11 && ccm populate -n 5 --vnodes && ccm start 2. Import some keyspace into it (approx 50 Mb of data) 3. Start repair on one node: ccm node2 nodetool repair KEYSPACE 4. While repair is still running, disconnect node3: sudo iptables -I INPUT -p tcp -d 127.0.0.3 -j DROP 5. This repair hangs. 6. Restore network connectivity 7. Repair is still hanging. 8. Following repairs will also hang. In tpstats I see tasks that make no progress: $ for i in {1..5}; do echo node$i; ccm node$i nodetool tpstats | grep "Repair#"; done node1 Repair#1 1 2255 1 0 0 node2 Repair#1 1 2335 26 0 0 node3 node4 Repair#3 1 147 2175 0 0 node5 Repair#1 1 2335 17 0 0 In jconsole I see that Repair threads are blocked here: Name: Repair#1:1 State: WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@73c5ab7e Total blocked: 0 Total waited: 242 Stack trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285) com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1371) org.apache.cassandra.repair.RepairJob.run(RepairJob.java:167) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) According to the source code, they are waiting for validations to complete: # ./apache-cassandra-2.2.8-src/src/java/org/apache/cassandra/repair/RepairJob.java 74 public void run() 75 { ... 166 // Wait for validation to complete 167 Futures.getUnchecked(validations); https://issues.apache.org/jira/browse/CASSANDRA-11824 says that problem was fixed in 2.2.7, but I use 2.2.11. Restart of all Cassandra nodes that have hanging tasks (one-by-one) allows these tasks to disappear from tpstats. After that repairs work well (until next network problem). I also suppose that long GC times on one node (as well as network issues) during repair may also lead to the same problem. Is it a known issue? -- Best Regards, Dmitry Simonov