Re: Repair failure under 0.8.6

2011-12-07 Thread Maxim Potekhin
I'm still having tons of problems with repairs and compactions, where the nodes are declared dead in their log files, although they were online at all times. This leads to problem behavior, i.e. once again I see that repair fails, and the cluster becomes unusable since there is no space to com

Re: Repair failure under 0.8.6

2011-12-05 Thread Maxim Potekhin
Basically I tweaked the phi, put in more verbose GC reporting and decided to do a compaction before I proceed. I'm getting this on the node where compaction is being run. And the system log for the other two nodes follows. It's obvious that the cluster is sick, but I can't determine why -- ther

Re: Repair failure under 0.8.6

2011-12-04 Thread Edward Capriolo
You can say the min compaction threshold to 2 and the max Compaction Threshold to 3. If you have enough disk space for a few minor compaction this should free up some disk space. On Sun, Dec 4, 2011 at 7:17 PM, Peter Schuller wrote: > > As a side effect of the failed repair (so it seems) the disk

Re: Repair failure under 0.8.6

2011-12-04 Thread Peter Schuller
> As a side effect of the failed repair (so it seems) the disk usage on the > affected node prevents compaction from working. It still works on > the remaining nodes (we have 3 total). > Is there a way to scrub the extraneous data? This is one of the reasons why killing an in-process repair is a b

Re: Repair failure under 0.8.6

2011-12-04 Thread Maxim Potekhin
As a side effect of the failed repair (so it seems) the disk usage on the affected node prevents compaction from working. It still works on the remaining nodes (we have 3 total). Is there a way to scrub the extraneous data? Thanks Maxim On 12/4/2011 4:29 PM, Peter Schuller wrote: I will try

Re: Repair failure under 0.8.6

2011-12-04 Thread Peter Schuller
> I will try to increase phi_convict -- I will just need to restart the > cluster after > the edit, right? You will need to restart the nodes for which you want the phi convict threshold to be different. You might want to do on e.g. half of the cluster to do A/B testing. > I do recall that I see

Re: Repair failure under 0.8.6

2011-12-04 Thread Maxim Potekhin
Please disregard the GC part of the question -- I found it. On 12/4/2011 4:12 PM, Maxim Potekhin wrote: Thanks Peter! I will try to increase phi_convict -- I will just need to restart the cluster after the edit, right? I do recall that I see nodes temporarily marked as down, only to pop up

Re: Repair failure under 0.8.6

2011-12-04 Thread Maxim Potekhin
Thanks Peter! I will try to increase phi_convict -- I will just need to restart the cluster after the edit, right? I do recall that I see nodes temporarily marked as down, only to pop up later. In the current situation, there is no load on the cluster at all, outside the maintenance like

Re: Repair failure under 0.8.6

2011-12-04 Thread Peter Schuller
> I capped heap and the error is still there. So I keep seeing "node dead" > messages even when I know the nodes were OK. Where and how do I tweak > timeouts? You can increase phi_convict_threshold in the configuration. However, I would rather want to find out why they are being marked as down to

Re: Repair failure under 0.8.6

2011-12-04 Thread Maxim Potekhin
I capped heap and the error is still there. So I keep seeing "node dead" messages even when I know the nodes were OK. Where and how do I tweak timeouts? 9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683) InetAddr

Re: Repair failure under 0.8.6

2011-12-03 Thread Maxim Potekhin
Thank you Peter. Before I look into details as you suggest, may I ask what you mean "automatically restarted"? They way the box and Cassandra are set up in my case is such that the death of either if final. Also, how do I look for full GC? I just realized that in the latest install, I might have

Re: Repair failure under 0.8.6

2011-12-03 Thread Peter Schuller
Filed https://issues.apache.org/jira/browse/CASSANDRA-3569 to fix it so that streams don't die due to conviction. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Repair failure under 0.8.6

2011-12-03 Thread Peter Schuller
> quite understand how Cassandra declared a node dead (in the below). Was is a > timeout? How do I fix that? I was about to respond to say that repair doesn't fail just due to failure detection, but this appears to have been broken by CASSANDRA-2433 :( Unless there is a subtle bug the exception y

Repair failure under 0.8.6

2011-12-03 Thread Maxim Potekhin
Please help -- I've been having pretty consistent failures that look like this one. Don't know how to proceed. Below text comes from the system log. The cluster was all up before and after the attempted repair, so I don't quite understand how Cassandra declared a node dead (in the below). Was is