Re: nodetool repair failure

Mitch Gitman Thu, 27 Jul 2017 21:36:49 -0700

Michael, thanks for the input. I don't think I'm going to need to upgrade
to 3.11 for the sake of getting nodetool repair working for me. Instead, I
have another plausible explanation and solution for my particular situation.

First, I should say that disk usage proved to be a red herring. There was
plenty of disk space available.

When I said that the error message I was seeing was no more precise than
"Some repair failed," I misstated things. Just above that error message was
another further detail: "Validation failed in /(IP address of host)." Of
course, that's still vague. What validation failed?

However, that extra information led me to this JIRA ticket:
https://issues.apache.org/jira/browse/CASSANDRA-10057. In particular this
comment: "If you invoke repair on multiple node at once, this can be
happen. Can you confirm? And once it happens, the error will continue
unless you restart the node since some resources remain due to the hang. I
will post the patch not to hang."

Now, the particular symptom to which that response refers is not what I was
seeing, but the response got me thinking that perhaps the failures I was
getting were on account of attempting to run "nodetool repair
--partitioner-range" simultaneously on all the nodes in my cluster. These
are only three-node dev clusters, and what I would see is that the repair
would pass on one node but fail on the other two.

So I tried running the repairs sequentially on each of the nodes. With this
change the repair works, and I have every expectation that it will continue
to work--that running repair sequentially is the solution to my particular
problem. If this is the case and repairs are intended to be run
sequentially, then that constitutes a contract change for nodetool repair.
This is the first time I'm running a repair on a multi-node cluster on
Cassandra 3.10, and only with 3.10 was I seeing this problem. I'd never
seen it previously running repairs on Cassandra 2.1 clusters, which is what
I was upgrading from.

The last comment in that particular JIRA ticket is coming from someone
reporting the same problem I'm seeing, and their experience indirectly
corroborates mine, or at least it doesn't contradict mine.

On Thu, Jul 27, 2017 at 10:26 AM, Michael Shuler <mich...@pbandjelly.org>
wrote:

> On 07/27/2017 12:10 PM, Mitch Gitman wrote:
> > I'm using Apache Cassandra 3.10.
> <snip>
> > this is a dev cluster I'm talking about.
> <more snippage>
> > Further insights welcome...
>
> Upgrade and see if one of the many fixes for 3.11.0 helped?
>
> https://github.com/apache/cassandra/blob/cassandra-3.11.
> 0/CHANGES.txt#L1-L129
>
> If you can reproduce on 3.11.0, hit JIRA with the steps to repro. There
> are several bug fixes committed to the cassandra-3.11 branch, pending a
> 3.11.1 release, but I don't see one that's particularly relevant to your
> trace.
>
> https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt
>
> --
> Kind regards,
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: nodetool repair failure

Reply via email to