I have observed one problem with an inconsistent ring that is
superficially similar (node thinks it's up but peers disagree) and noted
details in CASSANDRA-6082. However, it does not sound like the details
of either the symptoms, or the resolution match what you describe.
If you have not already, running nodetool goossipinfo might give you
more clues than `status`.
On 09/13/2013 10:48 AM, Dave Cowen wrote:
Hi, all -
We've been running Cassandra 1.1.12 in production since February, and have
experienced a vexing problem with an arbitrary node "falling out of" or
separating from the ring on occasion.
When a node "falls out" of the ring, running nodetool ring on the
misbehaving node shows that the misbehaving node believes that is Up, but
that the rest of the ring is Down, and the rest of the ring has question
marks listed for load. nodetool ring on any of the other nodes, however,
shows the misbehaving node as Down but everything else is up.
Shutting down and restarting the misbehaving node does not result in
changed behavior. We can only get the misbehaving node to rejoin the ring
by shutting it down, running nodetool removetoken <misbehaving node token>
and nodetool removetoken force elsewhere in the ring. After the node's
token has been removed from the ring, it will rejoin and behave normally
when it is restarted.
This is not a frequent occurrence - we can go months between this
happening. It most commonly occurs when a different node is brought down
and then back up, but it can happen spontaneously. This is also not
associated with a network connectivity event; we've seen no interruption in
the nodes being able to communicate over the network. As above, it's also
not isolated to a single node; we've seen this behavior on multiple nodes.
This has occurred with both the identical seeds specified in cassandra.yaml
on each node, and also when we remove the node from its own seed list (so
any seed won't try to auto-bootstrap from itself). Seeds have always been
up and available.
Has anyone else seen similar behavior? For obvious reasons, we hate seeing
one of the nodes suddenly "fall out" and require intervention when we flap
another node, or for no reason at all.
Thanks,
Dave