RE: node repair

Todd Burruss Mon, 22 Mar 2010 09:56:42 -0700

it's very possible if i thought it wasn't working.  is there a delay between 
compation and streaming?  maybe i didn't see any activity and assumed it was 
finished.  i am fairly certain that if i had seen streaming action via JMX i 
would not have restarted.  and i know i didn't see any compaction.

________________________________________
From: Stu Hood [stu.h...@rackspace.com]
Sent: Monday, March 22, 2010 7:08 AM
To: user@cassandra.apache.org
Subject: RE: node repair

Hey Todd,

Repair involves 2 major compactions in addition to the streaming. More 
information is logged about the compactions and repair when you are using DEBUG.

Do you think you might have restarted the node being repaired during the 
streaming process? I'm not sure we have good handling for that case.

Thanks,
Stu

-----Original Message-----
From: "Todd Burruss" <bburr...@real.com>
Sent: Sunday, March 21, 2010 3:43pm
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: RE: node repair

while preparing a test to capture logs i decided to not let the data set get 
too big and i did see it finish.  i only had 2gb this go around, but i had 
about 120+gb before so maybe it was just taking a long time.

that being said, i believe we need something in the logs, nodetool, and/or JMX 
that indicates a repair is happening and it is finished.  i see messages about 
streaming in JMX, but the current status for the crashed/repaired 
(192.168.132.105) host is "Sending a stream initiate message to 
/192.168.132.103".  but nodetool streams reports:

[bburr...@kv-app03 ~]$ ~/cassandra/bin/nodetool -host localhost -port 9000 
streams
Mode: Normal
 Nothing streaming to /192.168.132.105
Not receiving any streams.
[bburr...@kv-app03 ~]$

... so i guess it is working, but maybe just some confusion with messaging .. i 
can't explain why nodetool ring shows the sizes it does below except for read 
repair ... i'll keep an eye out for it again and try it again with more data.

thx
________________________________________
From: Stu Hood [stu.h...@rackspace.com]
Sent: Sunday, March 21, 2010 12:08 PM
To: user@cassandra.apache.org
Subject: RE: node repair

If you have debug logs from the run, would you mind opening a JIRA describing 
the problem?

-----Original Message-----
From: "Todd Burruss" <bburr...@real.com>
Sent: Sunday, March 21, 2010 1:30pm
To: "Todd Burruss" <bburr...@real.com>, "user@cassandra.apache.org" 
<user@cassandra.apache.org>
Subject: RE: node repair

one last comment about thesting this is i stopped all the servers, wiped their 
data and restarted.  allowed each node to get about 15gb on them, then repeated 
the test.  the nodetool repair does not repair the crashed node.

the only minorly interesting thing about my cluster is that i use random 
partitioner and assigned a token to each node.

________________________________________
From: Todd Burruss
Sent: Saturday, March 20, 2010 6:48 PM
To: Todd Burruss; user@cassandra.apache.org
Subject: RE: node repair

fyi ... i just compacted and node 105 is definitely not being repaired
________________________________________
From: Todd Burruss
Sent: Saturday, March 20, 2010 12:34 PM
To: user@cassandra.apache.org
Subject: RE: node repair

same IP, same token.  i'm trying Handling Failure, #3.

it is running, a part of the ring, and seems to be handling reads/writes, but 
does not appear to have received a copy of its data (the last node below).  
i've searched the all logs for ERRORs but there are none.  i will compact the 
other nodes, but i don't think it will make a difference.

[bburr...@kv-app05 ~]$ ~/cassandra/bin/nodetool -h localhost -p 9000 ring
Address       Status     Load          Range                                    
  Ring
                                       170141183460469231731687303715884105728
192.168.132.102Up         130.22 GB     42535295865117307932921825928971026431  
   |<--|
192.168.132.103Up         131.03 GB     85070591730234615865843651857942052863  
   |   |
192.168.132.104Up         125.7 GB      127605887595351923798765477786913079295 
   |   |
192.168.132.105Up         65.62 GB      170141183460469231731687303715884105728 
   |-->|

________________________________________
From: Jonathan Ellis [jbel...@gmail.com]
Sent: Saturday, March 20, 2010 11:23 AM
To: user@cassandra.apache.org
Subject: Re: node repair

if you bring up a new node w/ a different ip but the same token, it
will confuse things.

http://wiki.apache.org/cassandra/Operations "handling failure" section
covers best practices here.

On Sat, Mar 20, 2010 at 11:51 AM, Todd Burruss <bburr...@real.com> wrote:
> i had a node fail, lost all data.  so i brought it back up fresh, but 
> assigned it the same token in storage-conf.xml.  then ran nodetool repair.
>
> all compactions have finished, no streams are happening.  nothing.  so i did 
> it again.  same thing.  i don't think its working.  is there a log message i 
> can search for?  INFO is my log level.  i could try it again with debug i 
> suppose.
>
> thx

RE: node repair

Reply via email to