We have a 2 DC cluster running cassandra 1.2.9. They are in actual physically separate DCs on opposite coasts of the US, not just logical ones. The primary use of this cluster is CL.ONE reads out of a single column family. My expectation was that in such a scenario restarts would have minimal impact in the DC where the restart occurred, and no impact in the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on performance in the other (let's call them DCs "A" and "B").

Test scenario on a node in DC "A":
 * disablegossip: no change
 * drain: no change
 * stop node: no change
 * start node again: Large increase in latency in both DCs A *and* B

This is a graph showing the increase in latency (org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile) from DC *B* http://i.imgur.com/OkIQyXI.png (Actual clients report similar numbers that agree with this server side measurement). Latency jumps by over an order of magnitude and out of SLAs. (I would prefer restarting to not cause a latency spike in either DC, but the one induced in the remote DC is particularly concerning.)

However, the node that was restarted reports only a minor increase in latency http://i.imgur.com/KnGEJrE.png This is confusing from several different angles:
 * I would not expect any cross-dc reads to normally be occurring
* If there were cross DC reads, they would take 50+ ms instead of < 5 ms normally reported * If the node that was restarted was still somehow involved it reads, it's reporting shows it can only account for a small amount of the latency increase.

Some possible relevant configurations:
 * GossipingPropertyFileSnitch
 * dynamic_snitch_update_interval_in_ms: 100
 * dynamic_snitch_reset_interval_in_ms: 600000
 * dynamic_snitch_badness_threshold: 0.1
* read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (same type of behavior was observed with just read_repair_chance=0.1)

Has anyone else observed similar behavior and found a way to limit it? This seems like something that ought not to happen but without knowing why it is occurring I'm not sure how to stop it.

Reply via email to