Re: node replacement failed

Alain RODRIGUEZ Mon, 10 Sep 2018 04:50:49 -0700

Hello,

I am sorry it took us (the community) more than a day to answer to this
rather critical situation. That being said, my recommendation at this point
would be for you to make sure about the impacts of whatever you would try.
Working on a broken cluster, as an emergency might lead you to a second
mistake, possibly more destructive than the first one. It happened to me
and around, for many clusters. Move forward even more carefuly in these
situations as a global advice.


Suddenly i lost all disks of cassandar-data on one of my racks


With RF=2, I guess operations use LOCAL_ONE consistency, thus you should
have all the data in the safe rack(s) with your configuration, you probably
did not lose anything yet and have the service only using the nodes up,
that got the right data.

 tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>

As a side note, I would recommend you to use 'replace_address_first_boot'
instead of 'replace_address'. This does basically the same but will be
ignored after the first bootstrap. A detail, but hey, it's there and
somewhat safer, I would use this one.

java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces


By default, non-user keyspace use 'SimpleStrategy' and a small RF. Ideally,
this should be changed in a production cluster, and you're having an
example of why.

Now when i altered the system_traces keyspace startegy to
> NetworkTopologyStrategy and RF=2
> but then running nodetool repair failed: Endpoint not alive /IP of dead
> node that i'm trying to replace.
>

Changing the replication strategy you made the dead rack owner of part of
the token ranges, thus repairs just can't work as there will always be one
of the nodes involved down as the whole rack is down. Repair won't work,
but you probably do not need it! 'system_traces' is a temporary / debug
table. It's probably empty or with irrelevant data.

Here are some thoughts:

* It would be awesome at this point for us (and for you if you did not) to
see the status of the cluster:
** 'nodetool status'
** *'nodetool describecluster' *--> This one will tell if the nodes agree
on the schema (nodes up). I have seen schema changes with nodes down
inducing some issues.
*** *Cassandra version
** Number of racks (I assumer #racks >= 2 in this email)

*Option 1: (Change schema and) use replace method (preferred method)*
* Did you try to have the replace going, without any former repairs,
ignoring the fact 'system_traces' might be inconsistent? You probably don't
care about this table, so if Cassandra allows it with some of the nodes
down, going this way is relatively safe probably. I really do not see what
you could lose that matters in this table.
* Another option, if the schema first change was accepted, is to make the
second one, to drop this table. You can always rebuild it in case you need
it I assume.


*Option 2: Remove all the dead nodes *(try to avoid this option 2, if
option 1 works, it is better).

Please do not take an apply this like this. It's a thought on how you could
get rid of the issue, yet it's rather brutal and risky and I did not
consider it deeply and have no clue about your architecture and the
context. Consider it carefully on your side.

* You can also 'nodetool removenode' on each of the dead nodes. This will
have nodes streaming around and the rack isolation guarantee will no longer
be valid. It's hard to reason about what would happen to the data and in
terms of streaming.
* Alternatively, if you don't have enough space, you can even '*force*' the
'nodetool removenode'. See the documentation. Forcing it will prevent
streaming and remove the node (token ranges handover, but not the data). If
that does not work you can use the 'nodetool assassinate' command as well.

When adding nodes back to the broken DC, the first nodes will take probably
100% of the ownership, which is often too much. You can consider adding
back all the nodes with 'auto_bootstrap: false' before repairing them once
they have their final token ownership, the same ways we do when building a
new data center.

This option is not really clean, and have some caveats that* you need to
consider before starting* as there are token range movements and nodes
available that do not have the data. Yet this should work. I imagine it
would work nicely with RF=3 and QUORUM and with RF=2 (if you have 2+
racks), I guess it should work as well but you will have to pick one of
availability or consistency while repairing the data.

*Be aware that read requests hitting these nodes will not find data!* Plus,
you are using an *RF=2*. Thus using consistency of 2+ (TWO, QUORUM, ALL),
for at least one of reads or writes is needed to preserve consistency while
re-adding the nodes in this case. Otherwise, reads will not detect the
mismatch with certainty and might show inconsistent data the time for the
nodes to be repaired.

I must say, that I really prefer odd values for the RF, starting with RF=3.
Using RF=2 you will have to pick. Consistency or Availability. With a
consistency of ONE everywhere, the service is available, no single point of
failure. using anything bigger than this, for writes or read, brings
consistency but it creates single points of failures (actually any node
becomes a point of failure). RF=3 and QUORUM for both write and reads
take the best of the 2 worlds somehow. The tradeoff with RF=3 and quorum
reads is the latency increase and the resource usage.

Maybe is there a better approach, I am not too sure, but I think I would
try option 1 first in any case. It's less destructive, less risky, no token
range movements, no empty nodes available. I am not sure about limitation
you might face though and that's why I suggest a second option for you to
consider if the first is not actionable.

Let us know how it goes,
C*heers,
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 10 sept. 2018 à 09:09, onmstester onmstester <onmstes...@zoho.com>
a écrit :

> Any idea?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ---- On Sun, 09 Sep 2018 11:23:17 +0430 *onmstester onmstester
> <onmstes...@zoho.com <onmstes...@zoho.com>>* wrote ----
>
>
> Hi,
>
> Cluster Spec:
> 30 nodes
> RF = 2
> NetworkTopologyStrategy
> GossipingPropertyFileSnitch + rack aware
>
> Suddenly i lost all disks of cassandar-data on one of my racks, after
> replacing the disks, tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>
> starting the to-be-replace-node fails with:
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
> the problem is that i did not changed default replication config for
> System keyspaces, but Now when i altered the system_traces keyspace
> startegy to NetworkTopologyStrategy and RF=2
> but then running nodetool repair failed: Endpoint not alive /IP of dead
> node that i'm trying to replace.
>
> What should i do now?
> Can i just remove previous nodes, change dead nodes IPs and re-join them
> to cluster?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>

Re: node replacement failed

Reply via email to