Is there a reason you are using the trunk and not one of the tagged
releases? Official releases are a lot more stable than the trunk.
Yes, as we are using a combination of Ec2 and colo servers we are
needing to use broadcast_address from CASSANDRA-2491. The patch that is
associated with that JIRA does not apply cleanly against 0.8 so this is
why we are using trunk.
1) thrift timeouts & general degraded response times
For read or writes ? What sort of queries are you running ? Check the
local latency on each node using cfstats and cfhistogram, and a bit of
iostat
http://spyced.blogspot.com/2010/01/linux-performance-basics.html What
does nodetool tpstats say, is there a stage backing up?
If the local latency is OK look at the cross DC situation. What CL are
you using? Are nodes timing out waiting for nodes in other DC's ?
iostat doesn't show a request queue bottleneck. The timeouts we are
seeing is for reads. The latency on the nodes I have temporarily used
for reads is around 2-45ms. The next token in the ring at an alternate
DC is showing ~4ms with everything else around 0.05ms. tpstats desn't
show any active/pending. Reads are at CL.ONE & Writes using CL.ANY
2) *lots* of exception errors, such as:
Repair is trying to run on a response which is a digest response, this
should not be happening. Can you provide some more info on the type of
query you are running ?
The query being run is get cf1['user-id']['seg']
3) ring imbalances during a repair (refer to the above nodetool ring
output)
You may be seeing this
https://issues.apache.org/jira/browse/CASSANDRA-2280
I think it's a mistake that is it marked as resolved.
What can I do in regards to confirming this issue is still outstanding
and/or we are affected by it?
4) regular failure detection when any node does something only
moderately stressful, such as a repair or are under light load etc.
but the node itself thinks it is fine.
What version are you using ?
Version of failure detection? I've not seen anything on this so I
suspect this is the default.
Thanks,
Anton