Hi Daniel, We faced a similar issue during repair with reaper. We ran repair with more repair threads than number of cassandra nodes. But on and off repair was getting stuck and we had to do rolling restart of cluster or wait for lock time to expire (~1hr).
We had a look at the stuck repair, threadpools were getting stuck at AntiEntropy stage. From the synchronized block in repair code it appeared that per node at max 1 concurrent repair session per node is possible. According to https://medium.com/@mlowicki/cassandra-reaper-introduction- ed73410492bf#.f0erygqpk : Segment runner has protection mechanism to avoid overloading nodes using two simple rules to postpone repair if: 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20 by default) *2. Node is already running repair job* We tried running reaper with number of threads less than number of nodes (assuming reaper will not submit multiple segments to single cassandra node) but still it was observed that multiple repair segments were going to same node concurrently and threfore chances of nodes getting stuck in that state was possible. Finally we settled with single repair thread in reaper settings. Although takes a slightly more time but has completed successfully numerous times. Thread Dump of cassandra server when repair was getting stuck: "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x00007f0fa16226a0 nid=0x3c82 waiting for monitor entry [0x00007ee9eabaf000*] java.lang.Thread.State: BLOCKED (*on object monitor*) at org.apache.cassandra.service.ActiveRepairService. removeParentRepairSession(ActiveRepairService.java:392) - waiting to lock <0x000000067c083308> (a org.apache.cassandra.service.ActiveRepairService) at org.apache.cassandra.service.ActiveRepairService. doAntiCompaction(ActiveRepairService.java:417) at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb( RepairMessageVerbHandler.java:145) at org.apache.cassandra.net.MessageDeliveryTask.run( MessageDeliveryTask.java:67) at java.util.concurrent.Executors$RunnableAdapter. call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1142) Hope it helps! Regards, Bhuvan According to https://medium.com/@mlowicki/cassandra-reaper-introduction- ed73410492bf#.f0erygqpk : Segment runner has protection mechanism to avoid overloading nodes using two simple rules to postpone repair if: 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20 by default) 2. Node is already running repair job On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski < a...@thelastpickle.com> wrote: > Hi Daniel, > > could you file a bug in the issue tracker ? https://github.com/ > thelastpickle/cassandra-reaper/issues > > We'll figure out what's wrong and get your repairs running. > > Thanks ! > > On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky <dan...@kleviansky.com> > wrote: > >> Hi everyone, >> >> Using The Last Pickle's fork of Reaper, and unfortunately running into a >> bit of an issue. I'll try break it down below. >> >> # Problem Description: >> * After starting repair via the GUI, progress remains at 0/x. >> * Cassandra nodes calculate their respective token ranges, and then >> nothing happens. >> * There were no errors in the Reaper or Cassandra logs. Only a message of >> acknowledgement that a repair had initiated. >> * Performing stack trace on the running JVM, once can see that the thread >> spawning the repair process was waiting on a lock that was never being >> released. >> * This occurred on all nodes, and prevented any manually initiated repair >> process from running. A rolling restart of each node was required, after >> which one could run a `nodetool repair` successfully. >> >> # Cassandra Cluster Details: >> * Cassandra 2.2.5 running on Windows Server 2008 R2 >> * 6 node cluster, split across 2 DCs, with RF = 3:3. >> >> # Reaper Details: >> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL >> database. >> >> ## Reaper settings: >> * Parallism: DC-Aware >> * Repair Intensity: 0.9 >> * Incremental: true >> >> Don't want to swamp you with more details or unnecessary logs, especially >> as I'd have to sanitize them before sending them out, so please let me know >> if there is anything else I can provide, and I'll do my best to get it to >> you. >> >> Kind regards, >> Daniel >> > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com >