interesting, first just to make sure: since replicateOnWrite is for Counters, you are using counters (you use the word "insert" instead of "add/increment" ) right?
if you are using counters, supposedly the leader runs replicateOnWrite, somehow all your adds find the one box as leader, that's probably something worth investigating. finally, u could try debugging into those replicateOnWrite tasks and see what they are doing exactly, maybe there is something immediately wrong On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne <dha...@gmx.3crowd.com> wrote: > This is now my fourth attempt to get the message through. Apologies if you > see multiple copies. > > I've tried to give as much relevant data as I can think of, but please let me > know if you need any other info. I spent the day getting jmxtrans to talk to > statsd with the cassandra JMX data, so I can provide graphs of just about > anything you want. I can also re-run the tests to replicate the problem if > need be. As always, thanks for taking a look! > > I have a 4 host test cluster that I'm writing counters into running 0.8.1, > and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with > each and both gave the same results). Partitioner is the RandomPartitioner, > and RF=3. nodetool ring shows that actual data is well balanced: > > Address DC Rack Status State Load Owns > Token > > 127605887595351923798765477786913079296 > 10.0.0.57 datacenter1 rack1 Up Normal 1.59 GB 25.00% 0 > 10.0.0.56 datacenter1 rack1 Up Normal 1.39 GB 25.00% > 42535295865117307932921825928971026432 > 10.0.0.55 datacenter1 rack1 Up Normal 1.24 GB 25.00% > 85070591730234615865843651857942052864 > 10.0.0.54 datacenter1 rack1 Up Normal 1.34 GB 25.00% > 127605887595351923798765477786913079296 > > I also tested against a 1-box test cluster with RF=1. > > Both show the same results: high performance for a while, and then > ReplicateOnWrite backs way up (I've seen it as high as a million), the > cassandra process becomes unresponsive, and the hector client starts throwing > exceptions. High performance in this case equates to about 5-20k inserts/sec > on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so > the performance is bounded by the one box owning the row). The cassandra > process never seems to recover, even if left for 24 hours. It's still > chewing through those ReplicateOnWrite pending tasks. > > GC doesn't seem to be a factor. Logs show GCs completing in < 1s on all 4 > boxes in the test cluster, as well as on the 1 standalone server. > > I'm also seeing highly erratic performance from the cluster in general: > inserts/sec usually start at around 3k, but then maybe creep up towards 6k, > then drop back down to below 1k for an extended period of time, then maybe 1% > of the time they'll spike up to 8k or even 15k, but that doesn't last more > than 30 seconds. > > Here's the tpstats output from all 4 boxes: > > Slowest/most problematic/highest cpu usage one: > > This tpstats output is also representative of the one box cluster about 20 > minutes into my insert run. > > 4 $ ./nodetool -h localhost tpstats > Pool Name Active Pending Completed > ReadStage 0 0 0 > RequestResponseStage 0 0 59095840 > MutationStage 0 0 31182942 > ReadRepairStage 0 0 0 > ReplicateOnWriteStage 32 928042 28246716 > GossipStage 1 1 449464 > AntiEntropyStage 0 0 0 > MigrationStage 0 0 4 > MemtablePostFlusher 0 0 445 > StreamStage 0 0 0 > FlushWriter 0 0 445 > FILEUTILS-DELETE-POOL 0 0 38 > MiscStage 0 0 0 > FlushSorter 0 0 0 > InternalResponseStage 0 0 4 > HintedHandoff 0 0 0 > > Second highest cpu usage one: > > 3 $ ./nodetool -h localhost tpstats > Pool Name Active Pending Completed > ReadStage 0 0 0 > RequestResponseStage 0 0 27413910 > MutationStage 0 0 27523094 > ReadRepairStage 0 0 0 > ReplicateOnWriteStage 0 0 11111971 > GossipStage 0 0 335422 > AntiEntropyStage 0 0 0 > MigrationStage 0 0 4 > MemtablePostFlusher 0 0 66 > StreamStage 0 0 0 > FlushWriter 0 0 66 > FILEUTILS-DELETE-POOL 0 0 31 > MiscStage 0 0 0 > FlushSorter 0 0 0 > InternalResponseStage 0 0 4 > HintedHandoff 0 0 0 > > The other 2 that see minimal cpu usage: > > 2 $ ./nodetool -h localhost tpstats > Pool Name Active Pending Completed > ReadStage 0 0 0 > RequestResponseStage 0 0 9213999 > MutationStage 0 0 33233114 > ReadRepairStage 0 0 0 > ReplicateOnWriteStage 0 0 0 > GossipStage 0 0 336016 > AntiEntropyStage 0 0 0 > MigrationStage 0 0 4 > MemtablePostFlusher 0 0 77 > StreamStage 0 0 0 > FlushWriter 0 0 77 > FILEUTILS-DELETE-POOL 0 0 40 > MiscStage 0 0 0 > FlushSorter 0 0 0 > InternalResponseStage 0 0 4 > HintedHandoff 0 0 0 > > > 1 $ ./nodetool -h localhost tpstats > Pool Name Active Pending Completed > ReadStage 0 0 0 > RequestResponseStage 0 0 13796875 > MutationStage 0 0 36116662 > ReadRepairStage 0 0 0 > ReplicateOnWriteStage 0 0 0 > GossipStage 0 0 336496 > AntiEntropyStage 0 0 0 > MigrationStage 0 0 4 > MemtablePostFlusher 0 0 91 > StreamStage 0 0 0 > FlushWriter 0 0 91 > FILEUTILS-DELETE-POOL 0 0 43 > MiscStage 0 0 0 > FlushSorter 0 0 0 > InternalResponseStage 0 0 4 > HintedHandoff 0 0 0