They are evenly distributed.  5 nodes * 40 connections each using hector, and I 
can confirm that all 200 are active when this happened (from hector's 
perspective, from graphing the hector jmx data), and all 5 nodes saw roughly 40 
connections, and all were receiving traffic over those connections.  (netstat + 
ntop + trafshow, etc)  I can also confirm that I changed my insert strategy to 
break up the rows using composite row keys, which reduced the row lengths and 
gave me an almost perfectly even data distribution among the nodes, and that 
was when I started to really dig into why the ROWs were still backing up on one 
node specifically, and why 2 nodes weren't seeing any.  It was the 20%, 20%, 
60% ROW distribution that really got me thinking, and when I took the 60% node 
out of the cluster, that ROW load jumped back to the node with the next-lowest 
IP address, and the 2 nodes that weren't seeing any *still* wheren't seeing any 
ROWs.  At that point I tore down the cluster, recreated it as a 3 node cluster 
several times using various permutations of the 5 nodes available, and ROW load 
was *always* on the node with the lowest IP address.

the theory might not be right, but it certainly represents the behavior I saw.


On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote:

> We'll solve #2890 and we should have done it sooner.
> 
> That being said, a quick question: how do you do your inserts from the
> clients ? Are you evenly
> distributing the inserts among the nodes ? Or are you always hitting
> the same coordinator ?
> 
> Because provided the nodes are correctly distributed on the ring, if
> you distribute the inserts
> (increment) requests across the nodes (again I'm talking of client
> requests), you "should" not
> see the behavior you observe.
> 
> --
> Sylvain
> 
> On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne <dha...@gmx.3crowd.com> wrote:
>> It was exactly due to 2890, and the fact that the first replica is always 
>> the one with the lowest value IP address.  I patched cassandra to pick a 
>> random node out of the replica set in StorageProxy.java findSuitableEndpoint:
>> 
>> Random rng = new Random();
>> 
>> return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
>> endpoints.get(0);
>> 
>> Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
>> inserts/sec throughput.
>> 
>> Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite 
>> load of a counter insert:
>> 
>> One node will get RF/n of the disk work.  Two nodes will always get 0 disk 
>> work.
>> 
>> in a 3 node cluster, 1 node gets disk hit really hard.  You get the 
>> performance of a one-node cluster.
>> in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you 
>> the performance of ~2 node cluster.
>> in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
>> performance of a ~3 node cluster.
>> 
>> I confirmed this behavior with a 3, 4, and 5 node cluster size.
>> 
>> 
>>> 
>>>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
>>>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
>>>> normal?  I'm using RandomPartitioner...
>>>> 
>>>> Address         DC          Rack        Status State   Load            
>>>> Owns    Token
>>>>                                                                            
>>>> 136112946768375385385349842972707284580
>>>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00% 
>>>>  0
>>>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00% 
>>>>  34028236692093846346337460743176821145
>>>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00% 
>>>>  68056473384187692692674921486353642290
>>>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00% 
>>>>  102084710076281539039012382229530463435
>>>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00% 
>>>>  136112946768375385385349842972707284580
>>>> 
>>>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
>>>> and last node both have a count of 0.  This is a clean cluster, and I've 
>>>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
>>>> hours.  The last time this test ran, it went all the way down to 500 
>>>> inserts/sec before I killed it.
>>> 
>>> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
>>> 
>>> --
>>> Sylvain
>> 
>> 

Reply via email to