Is there a reason you are using the trunk and not one of the tagged releases? Official releases are a lot more stable than the trunk.

Yes, as we are using a combination of Ec2 and colo servers we are needing to use broadcast_address from CASSANDRA-2491. The patch that is associated with that JIRA does not apply cleanly against 0.8 so this is why we are using trunk.

1) thrift timeouts & general degraded response times
For read or writes ? What sort of queries are you running ? Check the local latency on each node using cfstats and cfhistogram, and a bit of iostat http://spyced.blogspot.com/2010/01/linux-performance-basics.html What does nodetool tpstats say, is there a stage backing up?

If the local latency is OK look at the cross DC situation. What CL are you using? Are nodes timing out waiting for nodes in other DC's ?

iostat doesn't show a request queue bottleneck. The timeouts we are seeing is for reads. The latency on the nodes I have temporarily used for reads is around 2-45ms. The next token in the ring at an alternate DC is showing ~4ms with everything else around 0.05ms. tpstats desn't show any active/pending. Reads are at CL.ONE & Writes using CL.ANY


2) *lots* of exception errors, such as:
Repair is trying to run on a response which is a digest response, this should not be happening. Can you provide some more info on the type of query you are running ?

The query being run is  get cf1['user-id']['seg']


3) ring imbalances during a repair (refer to the above nodetool ring output)
You may be seeing this
https://issues.apache.org/jira/browse/CASSANDRA-2280
I think it's a mistake that is it marked as resolved.

What can I do in regards to confirming this issue is still outstanding and/or we are affected by it?

4) regular failure detection when any node does something only moderately stressful, such as a repair or are under light load etc. but the node itself thinks it is fine.
What version are you using ?

Version of failure detection? I've not seen anything on this so I suspect this is the default.


Thanks,
Anton

Reply via email to