Re: Problems recovering a dead node

Héctor Izquierdo Seliva Wed, 04 May 2011 02:26:21 -0700

I'm sorry but I can't provide more detailed info as I have restarted the
node. After that the number of pending tasks started at 40, and rapidly
went down as compactions finished. After that, the ring looks ok, with
all the nodes having about the same amount of data. There were no errors
in the node log, only info messages.


Should I run repair again just in case? The next time I have to recover
a node, is there a safer/faster way of doing it?

My guess about the number of pending tasks and sstables is that during
repair, it seemed to ask for ranges of the same column family a lot of
times, thus yielding a lot of tiny sstables. This caused minor
compactions to pile up. I have read about never ending repairs, or
repairs done a lot of times over small amounts of data on the mailing
lists. Could this be what happened?


Thanks!


El mié, 04-05-2011 a las 21:02 +1200, aaron morton escribió:
> Certainly sounds a bit sick. 
> 
> The first error looks like it happens when the index file points to the wrong 
> place in the data file for the SSTable. The second one happens when the index 
> file is corrupted. The should be problems nodetool scrub can fix.
> 
> The disk space may be dead space to cassandra compaction or some other 
> streaming failure. You can check how much it considers to be live (in use) 
> space using nodetool cfstats. This will also tell you how many sstables are 
> live. Having a lot of dead SSTables is not necessarily a bad thing. 
> 
> What are the pending tasks ? what is nodetool tpstats showing ? And what does 
> nodetool ring show from one of the other nodes ? 
> 
> I'm assuming there are no errors in the logs on the node. What are the most 
> recent INFO messages?
> 
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 4 May 2011, at 17:54, Héctor Izquierdo Seliva wrote:
> 
> > 
> > Hi Aaron
> > 
> > It has no data files whatsoever. The upgrade path is 0.7.4 -> 0.7.5. It
> > turns out the initial problem was the sw raid failing silently because
> > of another faulty disk.
> > 
> > Now that the storage is working, I brought up the node again, same IP,
> > same token and tried doing nodetool repair. 
> > 
> > All adjacent nodes have finished the streaming session, and now the node
> > has a total of 248 GB of data. Is this normal when the load per node is
> > about 18GB? 
> > 
> > Also there are 1245 pending tasks. It's been compacting or rebuilding
> > sstables for the last 8 hours non stop. There are 2057 sstables in the
> > data folder.
> > 
> > Should I have done thing differently or is this the normal behaviour?
> > 
> > Thanks!
> > 
> > El mié, 04-05-2011 a las 07:54 +1200, aaron morton escribió:
> >> When you say "it's clean" does that mean the node has no data files ?
> >> 
> >> After you replaced the disk what process did you use to recover  ?
> >> 
> >> Also what version are you running and what's the recent upgrade history ?
> >> 
> >> Cheers
> >> Aaron
> >> 
> >> On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:
> >> 
> >>> Hi everyone. One of the nodes in my 6 node cluster died with disk
> >>> failures. I have replaced the disks, and it's clean. It has the same
> >>> configuration (same ip, same token).
> >>> 
> >>> When I try to restart the node it starts to throw mmap underflow
> >>> exceptions till it closes again.
> >>> 
> >>> I tried setting io to standard, but it still fails. It gives errors
> >>> about two decorated keys being different, and the EOFException.
> >>> 
> >>> Here is an excerpt of the log
> >>> 
> >>> http://pastebin.com/ZXW1wY6T
> >>> 
> >>> I can provide more info if needed. I'm at a loss here so any help is
> >>> appreciated.
> >>> 
> >>> Thanks all for your time
> >>> 
> >>> Héctor Izquierdo
> >>> 
> >> 
> > 
> > 
>

Re: Problems recovering a dead node

Reply via email to