You should consider upgrading to 0.7.6 to get a fix to Gossip. Earlier 0.7 releases were prone to marking nodes up and down when they should not have been. See https://github.com/apache/cassandra/blob/cassandra-0.7/CHANGES.txt#L22
Are the TimedOutExceptions to the client for read or write requests ? During the burst times which stages are backing up nodetool tpstats ? Compaction should not affect writes too much (assuming different log and data spindles). You could also take a look at the read and write latency stats for a particular CF using nodetool cfstats or JConsole. These will give you the stats for the local operations. You could also take a look at the iostats on the box http://spyced.blogspot.com/2010/01/linux-performance-basics.html Hope that helps. ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 7 Jun 2011, at 00:30, David Boxenhorn wrote: > Version 0.7.3. > > Yes, I am talking about minor compactions. I have three nodes, RF=3. > 3G data (before replication). Not many users (yet). It seems like 3 > nodes should be plenty. But when all 3 nodes are compacting, I > sometimes get timeouts on the client, and I see in my logs that each > one is full of notifications that the other nodes have died (and come > back to life after about a second). My cluster can tolerate one node > being out of commission, so I would rather have longer compactions one > at a time than shorter compactions all at the same time. > > I think that our usage pattern of bursty writes causes the three nodes > to decide to compact at the same time. These bursts are followed by > periods of relative quiet, so there should be time for the other two > nodes to compact one at a time. > > > On Mon, Jun 6, 2011 at 3:27 PM, David Boxenhorn <da...@citypath.com> wrote: >> >> Version 0.7.3. >> >> Yes, I am talking about minor compactions. I have three nodes, RF=3. 3G data >> (before replication). Not many users (yet). It seems like 3 nodes should be >> plenty. But when all 3 nodes are compacting, I sometimes get timeouts on the >> client, and I see in my logs that each one is full of notifications that the >> other nodes have died (and come back to life after about a second). My >> cluster can tolerate one node being out of commission, so I would rather >> have longer compactions one at a time than shorter compactions all at the >> same time. >> >> I think that our usage pattern of bursty writes causes the three nodes to >> decide to compact at the same time. These bursts are followed by periods of >> relative quiet, so there should be time for the other two nodes to compact >> one at a time. >> >> >> On Mon, Jun 6, 2011 at 2:36 PM, aaron morton <aa...@thelastpickle.com> wrote: >>> >>> Are you talking about minor (automatic) compactions ? Can you provide some >>> more information on what's happening to make the node unusable and what >>> version you are using? It's not lightweight process, but it should not hurt >>> the node that badly. It is considered an online operation. >>> >>> Delaying compaction will only make it run for longer and take more >>> resources. >>> >>> Cheers >>> >>> ----------------- >>> Aaron Morton >>> Freelance Cassandra Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 6 Jun 2011, at 20:14, David Boxenhorn wrote: >>> >>>> Is there some deep architectural reason why compaction can't be >>>> replication-aware? >>>> >>>> What I mean is, if one node is doing compaction, its replicas >>>> shouldn't be doing compaction at the same time. Or, at least a quorum >>>> of nodes should be available at all times. >>>> >>>> For example, if RF=3, and one node is doing compaction, the nodes to >>>> its right and left in the ring should wait on compaction until that >>>> node is done. >>>> >>>> Of course, my real problem is that compaction makes a node pretty much >>>> unavailable. If we can fix that problem then this is not necessary. >>> >>