Gossip mysteries (0.7.4 on EC2)

Alexis Lê-Quôc Mon, 28 Mar 2011 23:11:08 -0700

Hi,

To make a long story short I'm trying to understand what the logic behind the 
gossip is. The following is an excerpt from a log captured today.


2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 has restarted, now UP again
2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 state jump to normal
2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 has restarted, now UP again
2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 state jump to normal
2011-03-28T18:37:56.559275+00:00 Binding thrift service to /10.208.214.3:9160
2011-03-28T18:37:56.563852+00:00 Using TFastFramedTransport with a max frame 
size of 15728640 bytes.
2011-03-28T18:37:56.574986+00:00 Listening for thrift clients...
2011-03-28T18:37:56.805426+00:00 Node /10.202.61.193 has restarted, now UP again
2011-03-28T18:37:56.806171+00:00 Nodes /10.202.61.193 and /10.96.81.193 have 
the same token [XXX].  Ignoring /10.202.61.193

What's surprising is that 10.96.81.193 has been gone for at least 1 week (and 
was the subject of a bug CASSANDRA-2371).

> 2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 has restarted, now UP 
> again
> 2011-03-28T18:37:56.505316+00:00 Node /10.96.81.193 state jump to normal

The code I'm seeing is:
624  private void handleMajorStateChange(InetAddress ep, EndpointState epState)
625  {
626  if (endpointStateMap.get(ep) != null)
627  logger.info("Node {} has restarted, now UP again", ep);


endpointStateMap is not up-to-date. Shouldn't a gossip exchange mark the node 
as down immediately? Or if there's a network partition, at least the node could 
be marked as in an unknown state. I'm interpreting UP as available for 
cassandra's point of view.

> 2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 has restarted, now UP 
> again
> 2011-03-28T18:37:56.505468+00:00 Node /10.124.143.36 state jump to normal

That's fine, the node is actually up.

> 2011-03-28T18:37:56.805426+00:00 Node /10.202.61.193 has restarted, now UP 
> again

Fine too, 10.202.61.193 is up (for real).

2011-03-28T18:37:56.806171+00:00 Nodes /10.202.61.193 and /10.96.81.193 have 
the same token [XXX].  Ignoring /10.202.61.193

Not clear to me on which basis that decision is made. From the code I see

728  InetAddress currentOwner = tokenMetadata_.getEndpoint(token);
729  if (currentOwner == null)
730  {
...
735  }
736  else if (endpoint.equals(currentOwner))
737  {
...
741  }
742  else if (Gossiper.instance.compareEndpointStartup(endpoint, currentOwner) 
> 0)
743  {
...
749  }
750  else
751  {
752  logger_.info(String.format("Nodes %s and %s have the same token %s. 
Ignoring %s",
753 endpoint, currentOwner, token, endpoint));
754  }

So that decision is made by default, which in this particular case does not 
work. I've only cursorily looked at the source code so I don't know how 
tokenMetadata gets updated.

I've also seen today non-convering gossip rings, where some ghost nodes show 
up, others don't. My recourse has been so far to removetoken (force) since 
initial removetokens have left one node streaming to another but the recipient 
unaware of such streaming as shown in the following excerpts:

On Node 1 (sender) -- stuck forever in that state
Mode: Normal
Streaming to: /Node2
 
/data/cassandra/data/Intake/Metrics-f-2271-Data.db/(0,601518756),(601518756,700910088)
 progress=127695240/700910088 - 18%



On Node 2 (recipient) at the very same moment:
Mode: Normal
Not sending any streams.
Not receiving any streams.


This kind of discrepancy is also hard to understand given that nodetool ring on 
Node 1 yields:

...
Node2 Up  Normal 52.04 GB 51.03% token1

and the same command on Node 2 yields:

...
Node1 Up  Normal 50.89 GB 23.97% token2

Any light shed on both issues is appreciated.

-- 
Alexis Lê-Quôc (@datadoghq)

Gossip mysteries (0.7.4 on EC2)

Reply via email to