Cool.
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
On 11 Aug 2011, at 02:45, Mina Naguib wrote:
>
> Hi Aaron
>
> Thank you very much for the reply and the pointers to the previous list
> discussions. The second was was particularly telling.
>
> I'm happy to say that the problem is fixed, and it's so trivial it's quite
> embarrassing - but I'll state it here for the sake of the archives.
>
> There was an extra semicolon in the topology file in the line defining IPLA3.
> It's just as visible in my prod config as it is in my example below ;-)
>
> I'm guessing the parser splits <dc, rack> tuples on (":"), so it probably
> parsed the IPLA3 entry as "DCLA" , ":RAC1" (which is different than the
> others on "RAC1"), and so the NTS did its thing distributing evenly between
> racks, and IPLA3 got more of the data and IPLA2 got less.
>
> I''ve fixed it, and the reads/s and writes/s immediately equalized. I'm now
> doing a round of repairs/compactions/cleanups to equalize the data load as
> well.
>
> Unfortunately It's not easy in cassandra 0.7.8 to actually see the parsed
> topology state (unlike 0.8's nice ring output which shows the DC and rack),
> so I'm ashamed to say it took much longer than it should've to troubleshoot.
>
> Thanks for your help.
>
>
> On 2011-08-10, at 5:12 AM, aaron morton wrote:
>
>> WRT the load imbalance checking the basics: you've run cleanup after any
>> tokens moves? Repair is running ? Also sometimes nodes get a bit bloated
>> from repair and will settle down with compaction.
>>
>> Your slightly odd tokens in the MTL DC are making it a little tricky to
>> understand whats going on. But I'm trying to check if you've followed the
>> multi DC token selection here
>> http://wiki.apache.org/cassandra/Operations#Token_selection . Background
>> about what can happen in a multi dc deployment if the tokens are not right
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html
>>
>> This is what you currently haveā¦.
>>
>> DC:LA
>> IPLA1 Up Normal 34.57 GB 11.11% 0
>>
>> IPLA2 Up Normal 17.55 GB 11.11%
>> 56713727820156410577229101238628035242
>> IPLA3 Up Normal 51.37 GB 11.11%
>> 113427455640312821154458202477256070485
>>
>> DC: MTL
>> IPMTL1 Up Normal 34.43 GB 22.22%
>> 37809151880104273718152734159085356828
>> IPMTL2 Up Normal 34.56 GB 22.22%
>> 94522879700260684295381835397713392071
>> IPMTL3 Up Normal 34.71 GB 22.22%
>> 151236607520417094872610936636341427313
>>
>> Using the bump approach you would have
>>
>> IPLA1 0
>> IPLA2 56713727820156410577229101238628035242
>> IPLA3 113427455640312821154458202477256070484
>>
>> IPMTL1 1
>> IPMTL2 56713727820156410577229101238628035243
>> IPMTL3 113427455640312821154458202477256070485
>>
>> Using the interleaving you would have
>>
>> IPLA1 0
>> IPMTL1 28356863910078205288614550619314017621
>> IPLA2 56713727820156410577229101238628035242
>> IPMTL2 85070591730234615865843651857942052863
>> IPLA3 113427455640312821154458202477256070484
>> IPMTL3 141784319550391026443072753096570088105
>>
>> The current setup in LA give each node in LA 33% of the LA local ring. Which
>> should be right, just checking.
>>
>> If cleanup / repair / compaction is all good and you are confident the
>> tokens are right try poking around with nodetool getendpoints to see which
>> nodes keys are sent to. Like you I cannot see anything obvious in NTS that
>> would cause load to be imbalanced if they are all in the same rack.
>>
>> Cheers
>>
>>
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 10 Aug 2011, at 11:24, Mina Naguib wrote:
>>
>>> Hi everyone
>>>
>>> I'm observing a very peculiar type of imbalance and I'd appreciate any help
>>> or ideas to try. This is on cassandra 0.7.8.
>>>
>>> The original cluster was 3 machines in the DCMTL, equally balanced at
>>> 33.33% each and each holding roughly 34G.
>>>
>>> Then, I added to it 3 machines in the LA data center. The ring is
>>> currently as follows (IP addresses redacted for clarity):
>>>
>>> Address Status State Load Owns Token
>>>
>>>
>>> 151236607520417094872610936636341427313
>>> IPLA1 Up Normal 34.57 GB 11.11% 0
>>>
>>> IPMTL1 Up Normal 34.43 GB 22.22%
>>> 37809151880104273718152734159085356828
>>> IPLA2 Up Normal 17.55 GB 11.11%
>>> 56713727820156410577229101238628035242
>>> IPMTL2 Up Normal 34.56 GB 22.22%
>>> 94522879700260684295381835397713392071
>>> IPLA3 Up Normal 51.37 GB 11.11%
>>> 113427455640312821154458202477256070485
>>> IPMTL3 Up Normal 34.71 GB 22.22%
>>> 151236607520417094872610936636341427313
>>>
>>> The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines
>>> in yet another data center, but they're not ready yet to join the cluster.
>>> Once that third DC joins all nodes will be at 11.11%. However, I don't
>>> think this is related.
>>>
>>> The problem I'm currently observing is visible in the LA machines,
>>> specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3
>>> has 150% the expected volume.
>>>
>>> Putting their load side by side shows the peculiar ratio of 2:1:3 between
>>> the 3 LA nodes:
>>> 34.57 17.55 51.37
>>> (the same 2:1:3 ratio is reflected in our internal tools trending
>>> reads/second and writes/second)
>>>
>>> I've tried several iterations of compactions/cleanups to no avail. In
>>> terms of config this is the main keyspace:
>>> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>>> Options: [DCMTL:2, DCLA:2]
>>> And this is the cassandra-topology.properties file (IPs again redacted for
>>> clarity):
>>> IPMTL1:DCMTL:RAC1
>>> IPMTL2:DCMTL:RAC1
>>> IPMTL3:DCMTL:RAC1
>>> IPLA1:DCLA:RAC1
>>> IPLA2:DCLA:RAC1
>>> IPLA3:DCLA::RAC1
>>> IPLON1:DCLON:RAC1
>>> IPLON2:DCLON:RAC1
>>> IPLON3:DCLON:RAC1
>>> # default for unknown nodes
>>> default=DCBAD:RACBAD
>>>
>>>
>>> One thing that did occur to me while reading the source code for the
>>> NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers
>>> placing data on different racks. Since all my machines are defined as in
>>> the same rack, I believe that the 2-pass approach would still yield
>>> balanced placement.
>>>
>>> However, just to test, I modified live the topology file to specify that
>>> IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw
>>> immediately that the reads/second and writes/second equalized to expected
>>> fair volume (I quickly reverted that change).
>>>
>>> So, it seems somehow related to rack awareness, but I've been raking my
>>> head and I can't figure out how/why, or why the three MTL machines are not
>>> affected the same way.
>>>
>>> If the solution is to specify them in different racks and run repair on
>>> everything, I'm okay with that - but I hate doing that without first
>>> understanding *why* the current behavior is the way it is.
>>>
>>> Any ideas would be hugely appreciated.
>>>
>>> Thank you.
>>>
>>
>