I relaunched my cluster from the scratch (due to another reason). After the relaunch I could ran nodetool repair -par -inc -pr on the nodes without issue, but pretty match the moment when I started pushing production load to the cluster I ran into the same problem again. I opened a ticket first for adding logging info, but I'll most probably end up adding the logging by myself and I'll start digging through into the actual root cause.
I also ran one nodetool repair -par (ie. without incremental repair) and it seems that the repair started. Guess I need to go over the sources if there's a different code path which would explain this. I can't yet call this conclusive, but it seems that I can't run incremental repairs on the current 2.1.1 and I'm still wondering if anybody else is experiencing the same problem. On Thu, Oct 30, 2014 at 1:14 PM, Juho Mäkinen <juho.maki...@gmail.com> wrote: > No, the cluster seems to be performing just fine. It seems that the > prepareForRepair callback() could be easily modified to print which node(s) > are unable to respond, so that the debugging effort could be focused > better. This of course doesn't help this case as it's not trivial to add > the log lines and to roll it out to the entire cluster. > > The cluster is relatively young, containing only 450GB with RF=3 spread > over nine nodes and I'm still practicing how to run incremental repairs on > the cluster when I stumbled on this issue. > > On Thu, Oct 30, 2014 at 12:52 PM, Rahul Neelakantan <ra...@rahul.be> > wrote: > >> It appears to come from the ActiveRepairService.prepareForRepair portion >> of the Code. >> >> Are you sure all nodes are reachable from the node you are initiating >> repair on, at the same time? >> >> Any Node up/down/died messages? >> >> Rahul Neelakantan >> >> > On Oct 30, 2014, at 6:37 AM, Juho Mäkinen <juho.maki...@gmail.com> >> wrote: >> > >> > I'm having problems running nodetool repair -inc -par -pr on my 2.1.1 >> cluster due to "Did not get positive replies from all endpoints" error. >> > >> > Here's an example output: >> > root@db08-3:~# nodetool repair -par -inc -pr >> > [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system' >> > [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256 >> ranges for keyspace profiles (seq=false, full=false) >> > [2014-10-30 10:33:17,240] Repair failed with error Did not get positive >> replies from all endpoints. >> > [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256 >> ranges for keyspace OpsCenter (seq=false, full=false) >> > [2014-10-30 10:33:32,242] Repair failed with error Did not get positive >> replies from all endpoints. >> > [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256 >> ranges for keyspace system_traces (seq=false, full=false) >> > [2014-10-30 10:33:44,243] Repair failed with error Did not get positive >> replies from all endpoints. >> > >> > The local system log shows that the repair commands got started, but it >> seems that they immediately get cancelled due to that error, which btw >> can't be seen in the cassandra log. >> > >> > I tried monitoring all logs from all machines in case another machine >> would show up with some useful error, but so far I haven't found nothing. >> > >> > Any ideas where this error comes from? >> > >> > - Garo >> > >> > >