We'll solve #2890 and we should have done it sooner. That being said, a quick question: how do you do your inserts from the clients ? Are you evenly distributing the inserts among the nodes ? Or are you always hitting the same coordinator ?
Because provided the nodes are correctly distributed on the ring, if you distribute the inserts (increment) requests across the nodes (again I'm talking of client requests), you "should" not see the behavior you observe. -- Sylvain On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne <dha...@gmx.3crowd.com> wrote: > It was exactly due to 2890, and the fact that the first replica is always the > one with the lowest value IP address. I patched cassandra to pick a random > node out of the replica set in StorageProxy.java findSuitableEndpoint: > > Random rng = new Random(); > > return endpoints.get(rng.nextInt(endpoints.size())); // instead of return > endpoints.get(0); > > Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the > inserts/sec throughput. > > Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite > load of a counter insert: > > One node will get RF/n of the disk work. Two nodes will always get 0 disk > work. > > in a 3 node cluster, 1 node gets disk hit really hard. You get the > performance of a one-node cluster. > in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you > the performance of ~2 node cluster. > in a 10 node cluster, 1 node gets 30% of the disk work, giving you the > performance of a ~3 node cluster. > > I confirmed this behavior with a 3, 4, and 5 node cluster size. > > >> >>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >>> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >>> normal? I'm using RandomPartitioner... >>> >>> Address DC Rack Status State Load Owns >>> Token >>> >>> 136112946768375385385349842972707284580 >>> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% >>> 0 >>> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >>> 34028236692093846346337460743176821145 >>> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >>> 68056473384187692692674921486353642290 >>> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >>> 102084710076281539039012382229530463435 >>> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >>> 136112946768375385385349842972707284580 >>> >>> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >>> and last node both have a count of 0. This is a clean cluster, and I've >>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >>> hours. The last time this test ran, it went all the way down to 500 >>> inserts/sec before I killed it. >> >> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. >> >> -- >> Sylvain > >