I relaunched my cluster from the scratch (due to another reason). After the
relaunch I could ran nodetool repair -par -inc -pr on the nodes without
issue, but pretty match the moment when I started pushing production load
to the cluster I ran into the same problem again. I opened a ticket first
for adding logging info, but I'll most probably end up adding the logging
by myself and I'll start digging through into the actual root cause.

I also ran one nodetool repair -par (ie. without incremental repair) and it
seems that the repair started. Guess I need to go over the sources if
there's a different code path which would explain this.

I can't yet call this conclusive, but it seems that I can't run incremental
repairs on the current 2.1.1 and I'm still wondering if anybody else is
experiencing the same problem.

On Thu, Oct 30, 2014 at 1:14 PM, Juho Mäkinen <juho.maki...@gmail.com>
wrote:

> No, the cluster seems to be performing just fine. It seems that the
> prepareForRepair callback() could be easily modified to print which node(s)
> are unable to respond, so that the debugging effort could be focused
> better. This of course doesn't help this case as it's not trivial to add
> the log lines and to roll it out to the entire cluster.
>
> The cluster is relatively young, containing only 450GB with RF=3 spread
> over nine nodes and I'm still practicing how to run incremental repairs on
> the cluster when I stumbled on this issue.
>
> On Thu, Oct 30, 2014 at 12:52 PM, Rahul Neelakantan <ra...@rahul.be>
> wrote:
>
>> It appears to come from the ActiveRepairService.prepareForRepair portion
>> of the Code.
>>
>> Are you sure all nodes are reachable from the node you are initiating
>> repair on, at the same time?
>>
>> Any Node up/down/died messages?
>>
>> Rahul Neelakantan
>>
>> > On Oct 30, 2014, at 6:37 AM, Juho Mäkinen <juho.maki...@gmail.com>
>> wrote:
>> >
>> > I'm having problems running nodetool repair -inc -par -pr on my 2.1.1
>> cluster due to "Did not get positive replies from all endpoints" error.
>> >
>> > Here's an example output:
>> > root@db08-3:~# nodetool repair -par -inc -pr
>> > [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
>> > [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256
>> ranges for keyspace profiles (seq=false, full=false)
>> > [2014-10-30 10:33:17,240] Repair failed with error Did not get positive
>> replies from all endpoints.
>> > [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256
>> ranges for keyspace OpsCenter (seq=false, full=false)
>> > [2014-10-30 10:33:32,242] Repair failed with error Did not get positive
>> replies from all endpoints.
>> > [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256
>> ranges for keyspace system_traces (seq=false, full=false)
>> > [2014-10-30 10:33:44,243] Repair failed with error Did not get positive
>> replies from all endpoints.
>> >
>> > The local system log shows that the repair commands got started, but it
>> seems that they immediately get cancelled due to that error, which btw
>> can't be seen in the cassandra log.
>> >
>> > I tried monitoring all logs from all machines in case another machine
>> would show up with some useful error, but so far I haven't found nothing.
>> >
>> > Any ideas where this error comes from?
>> >
>> >  - Garo
>> >
>>
>
>

Reply via email to