On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne <dha...@gmx.3crowd.com> wrote: > Well, I was using a large number of clients: I tried configuring a hector > pool of 20-200 to see what affect that had on throughput. There's definitely > a point after which there's no gain, so I dialed it back down. To clarify a > few other things, when I say inserts I mean increments, as this test only > involves counters. When I say a row is full, I mean I've written all of the > data into it that belongs in it from our application's perspective. Not all > rows are the same length, and the average number of subcolumns per > supercolumn is higher for some rows than for others. >
Oh yes, you're using supercolumns. That could help magnify the problem a little bit, given that to read any subcolumn of a supercolumn, we have to read the whole supercolumn. Depends how many counters per supercolumns we're talking about. > I'm curious, is it necessary for a read to happen for replication when using > RF=1? No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's basically a bug, we'll fix that for 0.8.2. > I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 > connections/concurrent increments, and graphing various data points. If > something interesting happens, I'll send a graph. > > When you say stagger writes, do you mean just lower the overall number of > increments/sec my client is attempting, Yes, this is what I mean. >or do you mean I should stagger the load amongst many CFs/rows at once? Right >now I'm writing heavily into a small set of rows, each going into 3 CFs, >before moving on to the next small set of rows. > > > On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote: > >> When you do counter increment at CL.ONE, the write is acknowledged as >> soon as the first replica getting the the write has pushed the >> increment into his memtable. However, there is a read happening for >> the replication to the other replicas (this is necessary to the >> counter design). What is happening is that our nodes are not able to >> do those reads fast enough. Or, another way to see it is that you are >> writing faster than you cluster can handle. >> >> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do >> something about that, but for now you'll have to stagger your writes. >> I'm more surprised that tasks on the replicate on write stage would >> stack up at QUORUM, unless you're using a very large number of >> clients. >> >> -- >> Sylvain >> >> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dha...@gmx.3crowd.com> >> wrote: >>> This is now my fourth attempt to get the message through. Apologies if you >>> see multiple copies. >>> >>> I've tried to give as much relevant data as I can think of, but please let >>> me know if you need any other info. I spent the day getting jmxtrans to >>> talk to statsd with the cassandra JMX data, so I can provide graphs of just >>> about anything you want. I can also re-run the tests to replicate the >>> problem if need be. As always, thanks for taking a look! >>> >>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, >>> and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run >>> with each and both gave the same results). Partitioner is the >>> RandomPartitioner, and RF=3. nodetool ring shows that actual data is well >>> balanced: >>> >>> Address DC Rack Status State Load Owns >>> Token >>> >>> 127605887595351923798765477786913079296 >>> 10.0.0.57 datacenter1 rack1 Up Normal 1.59 GB 25.00% >>> 0 >>> 10.0.0.56 datacenter1 rack1 Up Normal 1.39 GB 25.00% >>> 42535295865117307932921825928971026432 >>> 10.0.0.55 datacenter1 rack1 Up Normal 1.24 GB 25.00% >>> 85070591730234615865843651857942052864 >>> 10.0.0.54 datacenter1 rack1 Up Normal 1.34 GB 25.00% >>> 127605887595351923798765477786913079296 >>> >>> I also tested against a 1-box test cluster with RF=1. >>> >>> Both show the same results: high performance for a while, and then >>> ReplicateOnWrite backs way up (I've seen it as high as a million), the >>> cassandra process becomes unresponsive, and the hector client starts >>> throwing exceptions. High performance in this case equates to about 5-20k >>> inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row >>> at a time, so the performance is bounded by the one box owning the row). >>> The cassandra process never seems to recover, even if left for 24 hours. >>> It's still chewing through those ReplicateOnWrite pending tasks. >>> >>> GC doesn't seem to be a factor. Logs show GCs completing in < 1s on all 4 >>> boxes in the test cluster, as well as on the 1 standalone server. >>> >>> I'm also seeing highly erratic performance from the cluster in general: >>> inserts/sec usually start at around 3k, but then maybe creep up towards 6k, >>> then drop back down to below 1k for an extended period of time, then maybe >>> 1% of the time they'll spike up to 8k or even 15k, but that doesn't last >>> more than 30 seconds. >>> >>> Here's the tpstats output from all 4 boxes: >>> >>> Slowest/most problematic/highest cpu usage one: >>> >>> This tpstats output is also representative of the one box cluster about 20 >>> minutes into my insert run. >>> >>> 4 $ ./nodetool -h localhost tpstats >>> Pool Name Active Pending Completed >>> ReadStage 0 0 0 >>> RequestResponseStage 0 0 59095840 >>> MutationStage 0 0 31182942 >>> ReadRepairStage 0 0 0 >>> ReplicateOnWriteStage 32 928042 28246716 >>> GossipStage 1 1 449464 >>> AntiEntropyStage 0 0 0 >>> MigrationStage 0 0 4 >>> MemtablePostFlusher 0 0 445 >>> StreamStage 0 0 0 >>> FlushWriter 0 0 445 >>> FILEUTILS-DELETE-POOL 0 0 38 >>> MiscStage 0 0 0 >>> FlushSorter 0 0 0 >>> InternalResponseStage 0 0 4 >>> HintedHandoff 0 0 0 >>> >>> Second highest cpu usage one: >>> >>> 3 $ ./nodetool -h localhost tpstats >>> Pool Name Active Pending Completed >>> ReadStage 0 0 0 >>> RequestResponseStage 0 0 27413910 >>> MutationStage 0 0 27523094 >>> ReadRepairStage 0 0 0 >>> ReplicateOnWriteStage 0 0 11111971 >>> GossipStage 0 0 335422 >>> AntiEntropyStage 0 0 0 >>> MigrationStage 0 0 4 >>> MemtablePostFlusher 0 0 66 >>> StreamStage 0 0 0 >>> FlushWriter 0 0 66 >>> FILEUTILS-DELETE-POOL 0 0 31 >>> MiscStage 0 0 0 >>> FlushSorter 0 0 0 >>> InternalResponseStage 0 0 4 >>> HintedHandoff 0 0 0 >>> >>> The other 2 that see minimal cpu usage: >>> >>> 2 $ ./nodetool -h localhost tpstats >>> Pool Name Active Pending Completed >>> ReadStage 0 0 0 >>> RequestResponseStage 0 0 9213999 >>> MutationStage 0 0 33233114 >>> ReadRepairStage 0 0 0 >>> ReplicateOnWriteStage 0 0 0 >>> GossipStage 0 0 336016 >>> AntiEntropyStage 0 0 0 >>> MigrationStage 0 0 4 >>> MemtablePostFlusher 0 0 77 >>> StreamStage 0 0 0 >>> FlushWriter 0 0 77 >>> FILEUTILS-DELETE-POOL 0 0 40 >>> MiscStage 0 0 0 >>> FlushSorter 0 0 0 >>> InternalResponseStage 0 0 4 >>> HintedHandoff 0 0 0 >>> >>> >>> 1 $ ./nodetool -h localhost tpstats >>> Pool Name Active Pending Completed >>> ReadStage 0 0 0 >>> RequestResponseStage 0 0 13796875 >>> MutationStage 0 0 36116662 >>> ReadRepairStage 0 0 0 >>> ReplicateOnWriteStage 0 0 0 >>> GossipStage 0 0 336496 >>> AntiEntropyStage 0 0 0 >>> MigrationStage 0 0 4 >>> MemtablePostFlusher 0 0 91 >>> StreamStage 0 0 0 >>> FlushWriter 0 0 91 >>> FILEUTILS-DELETE-POOL 0 0 43 >>> MiscStage 0 0 0 >>> FlushSorter 0 0 0 >>> InternalResponseStage 0 0 4 >>> HintedHandoff 0 0 0 > >