Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down and
they never come back?
There might be different reasons for flapping nodes, but to list what I
have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about the
issues some people are having when deploying C* on AWS EC2 (keyword to look
for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of reads
/ writes / bulkloads or e.g. unthrottled compaction etc., which may result
in extensive GC.

Could any of these be a problem in your case? I'd start from investigating
GC logs e.g. to see how long does the "stop the world" full GC take (GC
logs should be on by default from what I can see [1])

[1] https://issues.apache.org/jira/browse/CASSANDRA-5319

Michał


Kind regards,
Michał Michalski,
michal.michal...@boxever.com

On 2 April 2015 at 11:05, Marcin Pietraszek <mpietras...@opera.com> wrote:

> Hi!
>
> We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
> installed. Assume we have nodes A, B, C, D, E. On some irregular basis
> one of those nodes starts to report that subset of other nodes is in
> DN state although C* deamon on all nodes is running:
>
> A$ nodetool status
> UN B
> DN C
> DN D
> UN E
>
> B$ nodetool status
> UN A
> UN C
> UN D
> UN E
>
> C$ nodetool status
> DN A
> UN B
> UN D
> UN E
>
> After restart of A node, C and D report that A it's in UN and also A
> claims that whole cluster is in UN state. Right now I don't have any
> clear steps to reproduce that situation, do you guys have any idea
> what could be causing such behaviour? How this could be prevented?
>
> It seems like when A node is a coordinator and gets request for some
> data being replicated on C and D it respond with Unavailable
> exception, after restarting A that problem disapears.
>
> --
> mp
>

Reply via email to