On Tue, Apr 12, 2011 at 12:26 AM, aaron morton <aa...@thelastpickle.com>wrote:
> The reset interval clears the latency tracked for each node so a bad node > will be read from again. The scores for each node are then updated every > 100ms (default) using the last 100 responses from a node. > > How long does the bad performance last for? > Only a few seconds and but there are a lot of read requests during this time > > What CL are you reading at ? At Quorum with RF 4 the read request will be > sent to 3 nodes, ordered by proximity and wellness according to the dynamic > snitch. (for background recent discussion on dynamic snitch > http://www.mail-archive.com/user@cassandra.apache.org/msg12089.html) > I am reading with CL of ONE, read_repair_chance=0.33, RackInferringSnitch and keys_cached = rows_cached = 0 > > You can take a look at the weights and timings used by the DynamicSnitch in > JConsole under o.a.c.db.DynamicSnitchEndpoint . Also at DEBUG log level you > will be able to see which nodes the request is sent to. > Everything looks OK. The weights are around 3 for the nodes in the same data center and around 5 for the others. I will turn on the DEBUG level to see if I can find more info. > > My guess is the DynamicSnitch is doing the right thing and the slow down is > a node with a problem getting back into the list of nodes used for your > read. It's then moved down the list as it's bad performance is noticed. > Looking the DynamicSnitch MBean I don't see any problems with any of the nodes. My guess is that during the reset time there are reads that are sent to the other data center. > > Hope that helps > Aaron > Shimi > > On 12 Apr 2011, at 01:28, shimi wrote: > > I finally upgraded 0.6.x to 0.7.4. The nodes are running with the new > version for several days across 2 data centers. > I noticed that the read time in some of the nodes increase by x50-60 every > ten minutes. > There was no indication in the logs for something that happen at the same > time. The only thing that I know that is running every 10 minutes is > the dynamic snitch reset. > So I changed dynamic_snitch_reset_interval_in_ms to 20 minutes and now I > have the problem once in every 20 minutes. > > I am running all nodes with: > replica_placement_strategy: > org.apache.cassandra.locator.NetworkTopologyStrategy > strategy_options: > DC1 : 2 > DC2 : 2 > replication_factor: 4 > > (DC1 and DC2 are taken from the ips) > Does anyone familiar with this kind of behavior? > > Shimi > > >