Re: Repair failure under 0.8.6

Peter Schuller Sun, 04 Dec 2011 11:10:25 -0800

> I capped heap and the error is still there. So I keep seeing "node dead"
> messages even when I know the nodes were OK. Where and how do I tweak
> timeouts?


You can increase phi_convict_threshold in the configuration. However,
I would rather want to find out why they are being marked as down to
begin with. In a healthy situation, especially if you are not putting
extreme load on the cluster, there is very little reason for hosts to
be marked as down unless there's some bug somewhere.

Is this cluster under constant traffic? Are you seeing slow requests
from the point of view of the client (indicating that some requests
are routed to nodes that are temporarily inaccessible)?

With respect to GC, I would recommend running with -XX:+PrintGC and
-XX:PrintGCDetails and -XX:+PrintGCTimeStamps and
-XX:+PrintGCDateStamps and then look at the system log. A fallback to
full GC should be findable by grepping for "Full".

Also, is this a problem with one specific host, or is it happening to
all hosts every now and then? And I mean either the host being flagged
as down, or the host that is flagging others as down.

As for uncapped heap: Generally a larger heap is not going to make it
more likely to fall back to full GC; usually the opposite is true.
However, a larger heap can make some of the non-full GC pauses longer,
depending. In either case, r unning with the above GC options will
give you specific information on GC pauses and should allow you to
rule that out (or not).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Repair failure under 0.8.6

Reply via email to