> I capped heap and the error is still there. So I keep seeing "node dead" > messages even when I know the nodes were OK. Where and how do I tweak > timeouts?
You can increase phi_convict_threshold in the configuration. However, I would rather want to find out why they are being marked as down to begin with. In a healthy situation, especially if you are not putting extreme load on the cluster, there is very little reason for hosts to be marked as down unless there's some bug somewhere. Is this cluster under constant traffic? Are you seeing slow requests from the point of view of the client (indicating that some requests are routed to nodes that are temporarily inaccessible)? With respect to GC, I would recommend running with -XX:+PrintGC and -XX:PrintGCDetails and -XX:+PrintGCTimeStamps and -XX:+PrintGCDateStamps and then look at the system log. A fallback to full GC should be findable by grepping for "Full". Also, is this a problem with one specific host, or is it happening to all hosts every now and then? And I mean either the host being flagged as down, or the host that is flagging others as down. As for uncapped heap: Generally a larger heap is not going to make it more likely to fall back to full GC; usually the opposite is true. However, a larger heap can make some of the non-full GC pauses longer, depending. In either case, r unning with the above GC options will give you specific information on GC pauses and should allow you to rule that out (or not). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)