We have a 2 DC cluster running cassandra 1.2.9. They are in actual
physically separate DCs on opposite coasts of the US, not just logical
ones. The primary use of this cluster is CL.ONE reads out of a single
column family. My expectation was that in such a scenario restarts
would have minimal impact in the DC where the restart occurred, and no
impact in the remote DC.
We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs "A" and "B").
Test scenario on a node in DC "A":
* disablegossip: no change
* drain: no change
* stop node: no change
* start node again: Large increase in latency in both DCs A *and* B
This is a graph showing the increase in latency
(org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile)
from DC *B* http://i.imgur.com/OkIQyXI.png (Actual clients report
similar numbers that agree with this server side measurement). Latency
jumps by over an order of magnitude and out of SLAs. (I would prefer
restarting to not cause a latency spike in either DC, but the one
induced in the remote DC is particularly concerning.)
However, the node that was restarted reports only a minor increase in
latency http://i.imgur.com/KnGEJrE.png This is confusing from several
different angles:
* I would not expect any cross-dc reads to normally be occurring
* If there were cross DC reads, they would take 50+ ms instead of < 5
ms normally reported
* If the node that was restarted was still somehow involved it reads,
it's reporting shows it can only account for a small amount of the
latency increase.
Some possible relevant configurations:
* GossipingPropertyFileSnitch
* dynamic_snitch_update_interval_in_ms: 100
* dynamic_snitch_reset_interval_in_ms: 600000
* dynamic_snitch_badness_threshold: 0.1
* read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (same
type of behavior was observed with just read_repair_chance=0.1)
Has anyone else observed similar behavior and found a way to limit it?
This seems like something that ought not to happen but without knowing
why it is occurring I'm not sure how to stop it.