Re: Recovering from a faulty cassandra node

Marco Matarazzo Tue, 19 Mar 2013 09:04:39 -0700

I'm still missing something, please excuse me.

Let's say, for example, that I have a 4 node cluster with a replica factor of 
2. One node goes down and I have to reinstall it. In the meantime the cluster 
still works and data is read and written.


After a while the node is reinstalled, same IP is used, and cassandra 
configuration is restored (but data are not). Wouldn't be enough to just start 
cassandra, and maybe run a repair ?

On which node, and at which point of this scenario should I use decommission 
and/or removenode ?

Il giorno 19/mar/2013, alle ore 16:56, Alain RODRIGUEZ <arodr...@gmail.com> ha 
scritto:

> Decommission doesn't need a RF > 1 since it is run from the node being 
> removed from the cluster. It gives the data to the next node in the ring, 
> that will be responsible for it before leaving.
> Removenode (At least if it is like the old removetoken) use replicas to 
> dispatch the data to their new nodes. So yes, this one needs a RF > 1, but 
> has the advantage that it can be used having a node totally unreachable.
> 
> But anyway having a RF = 1 is pretty bad since you have a SPOF (Single Point 
> Of Failure) which can be avoided by C* with a higher RF.
> 
> Alain
> 
> 
> 2013/3/19 Marco Matarazzo <marco.matara...@hexkeep.com>
> Is nodetool removenode / decommission actually needed having a RF > 1 ? What 
> does it do, exactly ?
> 
> Il giorno 19/mar/2013, alle ore 16:45, Alain RODRIGUEZ <arodr...@gmail.com> 
> ha scritto:
> 
> > In 1.2, you may want to use the nodetool removenode if your server i broken 
> > or unreachable, else I guess nodetool decommission remains the good way to 
> > remove a node. (http://www.datastax.com/docs/1.2/references/nodetool)
> >
> > When this node is out, rm -rf /yourpath/cassandra/* on this serveur, change 
> > the configuration if needed (not sure about the auto_bootstrap param) and 
> > start Cassandra on that node again. It should join the ring as a new node.
> >
> > Good luck.
> >
> >
> > 2013/3/19 Hiller, Dean <dean.hil...@nrel.gov>
> > Since you "cleared" out that node, it IS the replacement node.
> >
> > Dean
> >
> > From: Jabbar Azam <aja...@gmail.com<mailto:aja...@gmail.com>>
> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> > Date: Tuesday, March 19, 2013 9:29 AM
> > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
> > <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> > Subject: Re: Recovering from a faulty cassandra node
> >
> > Hello Dean.
> >
> > I'm using vnodes so can't specify a token. In addition I can't follow the 
> > replace node docs because I don't have a replacement node.
> >
> >
> > On 19 March 2013 15:25, Hiller, Dean 
> > <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>> wrote:
> > I have not done this as of yet but from all that I have read your best 
> > option is to follow the replace node documentation which I belive you need 
> > to
> >
> >
> >  1.  Have the token be the same BUT add 1 to it so it doesn't think it's 
> > the same computer
> >  2.  Have the bootstrap option set or something so streaming takes affect.
> >
> > I would however test that all out in QA to make sure it works and if you 
> > have QUOROM reads/writes a good part of that test would be to take node X 
> > down after your node Y is back in the cluster to make sure reads/writes are 
> > working on the node you fixed…..you just need to make sure node X shares 
> > one of the token ranges of node Y AND your writes/reads are in that token 
> > range.
> >
> > Dean
> >
> > From: Jabbar Azam 
> > <aja...@gmail.com<mailto:aja...@gmail.com><mailto:aja...@gmail.com<mailto:aja...@gmail.com>>>
> > Reply-To: 
> > "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
> >  
> > <user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> > Date: Tuesday, March 19, 2013 8:51 AM
> > To: 
> > "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
> >  
> > <user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> > Subject: Recovering from a faulty cassandra node
> >
> > Hello,
> >
> > I am using Cassandra 1.2.2 on a 4 node test cluster with vnodes. I waited 
> > for over a week to insert lots of data into the cluster. During the end of 
> > the process one of the nodes had a hardware fault.
> >
> > I have fixed the hardware fault but the filing system on that node is 
> > corrupt so I'll have to reinstall the OS and cassandra.
> >
> > I can think of two ways of reintegrating the host into the cluster
> >
> > 1) shrink the cluster to three nodes and add the node into the cluster
> >
> > 2) Add the node into the cluster without shrinking
> >
> > I'm not sure of the best approach to take and I'm not sure how to achieve 
> > each step.
> >
> > Can anybody help?
> >
> >
> > --
> > Thanks
> >
> >  Jabbar Azam
> >
> >
> >
> > --
> > Thanks
> >
> > Jabbar Azam
> >
> 
> --
> Marco Matarazzo
> == Hex Keep ==
> 
> W: http://www.hexkeep.com
> M: +39 347 8798528
> E: marco.matara...@hexkeep.com
> 
> "You can learn more about a man
>   in one hour of play
>   than in one year of conversation.” - Plato
> 
> 
> 
> 
> 

--
Marco Matarazzo
== Hex Keep ==

W: http://www.hexkeep.com
M: +39 347 8798528
E: marco.matara...@hexkeep.com

"You can learn more about a man
  in one hour of play
  than in one year of conversation.” - Plato

Re: Recovering from a faulty cassandra node

Reply via email to