Odd Node Behavior

E S Mon, 14 May 2012 06:01:20 -0700

Hello,

I am having some very strange issues with a cassandra setup.  I recognize that 
this is not the ideal cluster setup, but I'd still like to try and understand 
what is going wrong.


The cluster has 3 machines (A,B,C) running Cassandra 1.0.9 with JNA.  A & B are 
in datacenter1 while C is in datacenter2.  Cassandra knows about the different 
datacenter because of the rack inferred snitch.  However, we are currently 
using a simple placement strategy on the keyspace.  All reads and writes are 
done with quorum.  Hinted handoffs are enabled.  Most the the cassandra 
settings are at their defaults, with the exception of thrift message sizes, 
which we have upped to 256 mb (while very rare, we can sometimes have a few 
larger rows so wanted a big buffer).  There is a firewall between the two 
datacenters.  We have enabled TCP traffic for the thrift and storage ports (but 
not JMX, and no UDP)

Another odd thing is that there are actually 2 cassandra clusters hosted on 
these machines (although with the same setup).  Each machine has 2 cassandra 
processes, but everything is running on different ports and different cluster 
names.

On one of the two clusters we were doing some failover testing.  We would take 
nodes down quickly in succession and make sure sure the system remained up.

Most of the time, we got a few timeouts on the failover (unexpected, but not 
the end of the world) and then quickly recovered; however, twice we were able 
to put the cluster in an unusable state.  We found that sometimes node C, while 
seemingly up (no load, and marked as UP in the ring by other nodes), was 
unresponsive to B (when A was down) when B was coordinating a quorum write.  We 
see B making a request in the logs (on debug) and 10 seconds later timing out.  
We see nothing happening in C's log (also debug).  The box is just idling.  In 
retrospect, I should have put it in trace (will do this next time).  We had it 
come back after 30 minutes once.  Another time, it came back earlier after 
cycling it.

I also noticed a few other crazy log messages on C in that time period.  There 
were two instances of "invalid protocol header", which in code seems to only 
happen when PROTOCOL_MAGIC doesn't match (MessagingService.java), which seems 
like an impossible state.

I'm currently at a loss trying to explain what is going on.  Has anyone seen 
anything like this?  I'd appreciate any additional debugging ideas!  Thanks for 
any help.

Regards,
Eddie

Odd Node Behavior

Reply via email to