It was exactly due to 2890, and the fact that the first replica is always the 
one with the lowest value IP address.  I patched cassandra to pick a random 
node out of the replica set in StorageProxy.java findSuitableEndpoint:

Random rng = new Random();

return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
endpoints.get(0);

Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
inserts/sec throughput.

Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite load 
of a counter insert:

One node will get RF/n of the disk work.  Two nodes will always get 0 disk work.

in a 3 node cluster, 1 node gets disk hit really hard.  You get the performance 
of a one-node cluster.
in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the 
performance of ~2 node cluster.
in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
performance of a ~3 node cluster.

I confirmed this behavior with a 3, 4, and 5 node cluster size.


> 
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
>> normal?  I'm using RandomPartitioner...
>> 
>> Address         DC          Rack        Status State   Load            Owns  
>>   Token
>>                                                                            
>> 136112946768375385385349842972707284580
>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%  
>> 34028236692093846346337460743176821145
>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%  
>> 68056473384187692692674921486353642290
>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%  
>> 102084710076281539039012382229530463435
>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%  
>> 136112946768375385385349842972707284580
>> 
>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
>> and last node both have a count of 0.  This is a clean cluster, and I've 
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
>> hours.  The last time this test ran, it went all the way down to 500 
>> inserts/sec before I killed it.
> 
> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
> 
> --
> Sylvain

Reply via email to