Hi everyone

I'm observing a very peculiar type of imbalance and I'd appreciate any help or 
ideas to try.  This is on cassandra 0.7.8.

The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% 
each and each holding roughly 34G.

Then, I added to it 3 machines in the LA data center.  The ring is currently as 
follows (IP addresses redacted for clarity):

Address         Status State   Load            Owns    Token                    
                   
                                                       
151236607520417094872610936636341427313     
IPLA1           Up     Normal  34.57 GB        11.11%  0                        
                   
IPMTL1          Up     Normal  34.43 GB        22.22%  
37809151880104273718152734159085356828      
IPLA2           Up     Normal  17.55 GB        11.11%  
56713727820156410577229101238628035242      
IPMTL2          Up     Normal  34.56 GB        22.22%  
94522879700260684295381835397713392071      
IPLA3           Up     Normal  51.37 GB        11.11%  
113427455640312821154458202477256070485     
IPMTL3          Up     Normal  34.71 GB        22.22%  
151236607520417094872610936636341427313     

The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in 
yet another data center, but they're not ready yet to join the cluster.  Once 
that third DC joins all nodes will be at 11.11%. However, I don't think this is 
related.

The problem I'm currently observing is visible in the LA machines, specifically 
IPLA2 and IPLA3.  IPLA2 has 50% the expected volume, and IPLA3 has 150% the 
expected volume.

Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 
LA nodes:
34.57 17.55 51.37
(the same 2:1:3 ratio is reflected in our internal tools trending reads/second 
and writes/second)

I've tried several iterations of compactions/cleanups to no avail.  In terms of 
config this is the main keyspace:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
    Options: [DCMTL:2, DCLA:2]
And this is the cassandra-topology.properties file (IPs again redacted for 
clarity):
  IPMTL1:DCMTL:RAC1
  IPMTL2:DCMTL:RAC1
  IPMTL3:DCMTL:RAC1
  IPLA1:DCLA:RAC1
  IPLA2:DCLA:RAC1
  IPLA3:DCLA::RAC1
  IPLON1:DCLON:RAC1
  IPLON2:DCLON:RAC1
  IPLON3:DCLON:RAC1
  # default for unknown nodes
  default=DCBAD:RACBAD


One thing that did occur to me while reading the source code for the 
NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing 
data on different racks.  Since all my machines are defined as in the same 
rack, I believe that the 2-pass approach would still yield balanced placement.

However, just to test, I modified live the topology file to specify that IPLA1, 
IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately 
that the reads/second and writes/second equalized to expected fair volume (I 
quickly reverted that change).

So, it seems somehow related to rack awareness, but I've been raking my head 
and I can't figure out how/why, or why the three MTL machines are not affected 
the same way.

If the solution is to specify them in different racks and run repair on 
everything, I'm okay with that - but I hate doing that without first 
understanding *why* the current behavior is the way it is.

Any ideas would be hugely appreciated.

Thank you.

Reply via email to