Hello, I am having some very strange issues with a cassandra setup. I recognize that this is not the ideal cluster setup, but I'd still like to try and understand what is going wrong.
The cluster has 3 machines (A,B,C) running Cassandra 1.0.9 with JNA. A & B are in datacenter1 while C is in datacenter2. Cassandra knows about the different datacenter because of the rack inferred snitch. However, we are currently using a simple placement strategy on the keyspace. All reads and writes are done with quorum. Hinted handoffs are enabled. Most the the cassandra settings are at their defaults, with the exception of thrift message sizes, which we have upped to 256 mb (while very rare, we can sometimes have a few larger rows so wanted a big buffer). There is a firewall between the two datacenters. We have enabled TCP traffic for the thrift and storage ports (but not JMX, and no UDP) Another odd thing is that there are actually 2 cassandra clusters hosted on these machines (although with the same setup). Each machine has 2 cassandra processes, but everything is running on different ports and different cluster names. On one of the two clusters we were doing some failover testing. We would take nodes down quickly in succession and make sure sure the system remained up. Most of the time, we got a few timeouts on the failover (unexpected, but not the end of the world) and then quickly recovered; however, twice we were able to put the cluster in an unusable state. We found that sometimes node C, while seemingly up (no load, and marked as UP in the ring by other nodes), was unresponsive to B (when A was down) when B was coordinating a quorum write. We see B making a request in the logs (on debug) and 10 seconds later timing out. We see nothing happening in C's log (also debug). The box is just idling. In retrospect, I should have put it in trace (will do this next time). We had it come back after 30 minutes once. Another time, it came back earlier after cycling it. I also noticed a few other crazy log messages on C in that time period. There were two instances of "invalid protocol header", which in code seems to only happen when PROTOCOL_MAGIC doesn't match (MessagingService.java), which seems like an impossible state. I'm currently at a loss trying to explain what is going on. Has anyone seen anything like this? I'd appreciate any additional debugging ideas! Thanks for any help. Regards, Eddie