Multi-dc restart impact

Chris Burroughs Thu, 05 Sep 2013 06:15:33 -0700

We have a 2 DC cluster running cassandra 1.2.9. They are in actualphysically separate DCs on opposite coasts of the US, not just logicalones. The primary use of this cluster is CL.ONE reads out of a singlecolumn family. My expectation was that in such a scenario restartswould have minimal impact in the DC where the restart occurred, and noimpact in the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact onperformance in the other (let's call them DCs "A" and "B").


Test scenario on a node in DC "A":
 * disablegossip: no change
 * drain: no change
 * stop node: no change
 * start node again: Large increase in latency in both DCs A *and* B

This is a graph showing the increase in latency(org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile)from DC *B* http://i.imgur.com/OkIQyXI.png (Actual clients reportsimilar numbers that agree with this server side measurement). Latencyjumps by over an order of magnitude and out of SLAs. (I would preferrestarting to not cause a latency spike in either DC, but the oneinduced in the remote DC is particularly concerning.)

However, the node that was restarted reports only a minor increase inlatency http://i.imgur.com/KnGEJrE.png This is confusing from severaldifferent angles:

 * I would not expect any cross-dc reads to normally be occurring

* If there were cross DC reads, they would take 50+ ms instead of < 5ms normally reported* If the node that was restarted was still somehow involved it reads,it's reporting shows it can only account for a small amount of thelatency increase.


Some possible relevant configurations:
 * GossipingPropertyFileSnitch
 * dynamic_snitch_update_interval_in_ms: 100
 * dynamic_snitch_reset_interval_in_ms: 600000
 * dynamic_snitch_badness_threshold: 0.1

* read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (sametype of behavior was observed with just read_repair_chance=0.1)

Has anyone else observed similar behavior and found a way to limit it?This seems like something that ought not to happen but without knowingwhy it is occurring I'm not sure how to stop it.

Multi-dc restart impact

Reply via email to