Re: Repair failure under 0.8.6

Maxim Potekhin Sun, 04 Dec 2011 13:16:41 -0800

Please disregard the GC part of the question -- I found it.


On 12/4/2011 4:12 PM, Maxim Potekhin wrote:

Thanks Peter!

I will try to increase phi_convict -- I will just need to restart thecluster after

the edit, right?

I do recall that I see nodes temporarily marked as down, only to popup later.

In the current situation, there is no load on the cluster at all,outside the

maintenance like the repair.

How do I configure the print level for the GC report?

Thank you,
Maxim


On 12/4/2011 2:09 PM, Peter Schuller wrote:

I capped heap and the error is still there. So I keep seeing "nodedead"
messages even when I know the nodes were OK. Where and how do I tweak
timeouts?

You can increase phi_convict_threshold in the configuration. However,
I would rather want to find out why they are being marked as down to
begin with. In a healthy situation, especially if you are not putting
extreme load on the cluster, there is very little reason for hosts to
be marked as down unless there's some bug somewhere.

Is this cluster under constant traffic? Are you seeing slow requests
from the point of view of the client (indicating that some requests
are routed to nodes that are temporarily inaccessible)?

With respect to GC, I would recommend running with -XX:+PrintGC and
-XX:PrintGCDetails and -XX:+PrintGCTimeStamps and
-XX:+PrintGCDateStamps and then look at the system log. A fallback to
full GC should be findable by grepping for "Full".

Also, is this a problem with one specific host, or is it happening to
all hosts every now and then? And I mean either the host being flagged
as down, or the host that is flagging others as down.

As for uncapped heap: Generally a larger heap is not going to make it
more likely to fall back to full GC; usually the opposite is true.
However, a larger heap can make some of the non-full GC pauses longer,
depending. In either case, r unning with the above GC options will
give you specific information on GC pauses and should allow you to
rule that out (or not).

Re: Repair failure under 0.8.6

Reply via email to