WRT the load imbalance checking the basics: you've run cleanup after any tokens moves? Repair is running ? Also sometimes nodes get a bit bloated from repair and will settle down with compaction.
Your slightly odd tokens in the MTL DC are making it a little tricky to understand whats going on. But I'm trying to check if you've followed the multi DC token selection here http://wiki.apache.org/cassandra/Operations#Token_selection . Background about what can happen in a multi dc deployment if the tokens are not right http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html This is what you currently haveā¦. DC:LA IPLA1 Up Normal 34.57 GB 11.11% 0 IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242 IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485 DC: MTL IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828 IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071 IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313 Using the bump approach you would have IPLA1 0 IPLA2 56713727820156410577229101238628035242 IPLA3 113427455640312821154458202477256070484 IPMTL1 1 IPMTL2 56713727820156410577229101238628035243 IPMTL3 113427455640312821154458202477256070485 Using the interleaving you would have IPLA1 0 IPMTL1 28356863910078205288614550619314017621 IPLA2 56713727820156410577229101238628035242 IPMTL2 85070591730234615865843651857942052863 IPLA3 113427455640312821154458202477256070484 IPMTL3 141784319550391026443072753096570088105 The current setup in LA give each node in LA 33% of the LA local ring. Which should be right, just checking. If cleanup / repair / compaction is all good and you are confident the tokens are right try poking around with nodetool getendpoints to see which nodes keys are sent to. Like you I cannot see anything obvious in NTS that would cause load to be imbalanced if they are all in the same rack. Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10 Aug 2011, at 11:24, Mina Naguib wrote: > Hi everyone > > I'm observing a very peculiar type of imbalance and I'd appreciate any help > or ideas to try. This is on cassandra 0.7.8. > > The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% > each and each holding roughly 34G. > > Then, I added to it 3 machines in the LA data center. The ring is currently > as follows (IP addresses redacted for clarity): > > Address Status State Load Owns Token > > > 151236607520417094872610936636341427313 > IPLA1 Up Normal 34.57 GB 11.11% 0 > > IPMTL1 Up Normal 34.43 GB 22.22% > 37809151880104273718152734159085356828 > IPLA2 Up Normal 17.55 GB 11.11% > 56713727820156410577229101238628035242 > IPMTL2 Up Normal 34.56 GB 22.22% > 94522879700260684295381835397713392071 > IPLA3 Up Normal 51.37 GB 11.11% > 113427455640312821154458202477256070485 > IPMTL3 Up Normal 34.71 GB 22.22% > 151236607520417094872610936636341427313 > > The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in > yet another data center, but they're not ready yet to join the cluster. Once > that third DC joins all nodes will be at 11.11%. However, I don't think this > is related. > > The problem I'm currently observing is visible in the LA machines, > specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 > has 150% the expected volume. > > Putting their load side by side shows the peculiar ratio of 2:1:3 between the > 3 LA nodes: > 34.57 17.55 51.37 > (the same 2:1:3 ratio is reflected in our internal tools trending > reads/second and writes/second) > > I've tried several iterations of compactions/cleanups to no avail. In terms > of config this is the main keyspace: > Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy > Options: [DCMTL:2, DCLA:2] > And this is the cassandra-topology.properties file (IPs again redacted for > clarity): > IPMTL1:DCMTL:RAC1 > IPMTL2:DCMTL:RAC1 > IPMTL3:DCMTL:RAC1 > IPLA1:DCLA:RAC1 > IPLA2:DCLA:RAC1 > IPLA3:DCLA::RAC1 > IPLON1:DCLON:RAC1 > IPLON2:DCLON:RAC1 > IPLON3:DCLON:RAC1 > # default for unknown nodes > default=DCBAD:RACBAD > > > One thing that did occur to me while reading the source code for the > NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers > placing data on different racks. Since all my machines are defined as in the > same rack, I believe that the 2-pass approach would still yield balanced > placement. > > However, just to test, I modified live the topology file to specify that > IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw > immediately that the reads/second and writes/second equalized to expected > fair volume (I quickly reverted that change). > > So, it seems somehow related to rack awareness, but I've been raking my head > and I can't figure out how/why, or why the three MTL machines are not > affected the same way. > > If the solution is to specify them in different racks and run repair on > everything, I'm okay with that - but I hate doing that without first > understanding *why* the current behavior is the way it is. > > Any ideas would be hugely appreciated. > > Thank you. >