It was exactly due to 2890, and the fact that the first replica is always the one with the lowest value IP address. I patched cassandra to pick a random node out of the replica set in StorageProxy.java findSuitableEndpoint:
Random rng = new Random(); return endpoints.get(rng.nextInt(endpoints.size())); // instead of return endpoints.get(0); Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the inserts/sec throughput. Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite load of a counter insert: One node will get RF/n of the disk work. Two nodes will always get 0 disk work. in a 3 node cluster, 1 node gets disk hit really hard. You get the performance of a one-node cluster. in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the performance of ~2 node cluster. in a 10 node cluster, 1 node gets 30% of the disk work, giving you the performance of a ~3 node cluster. I confirmed this behavior with a 3, 4, and 5 node cluster size. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC Rack Status State Load Owns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain