Hi everyone
I'm observing a very peculiar type of imbalance and I'd appreciate any help or
ideas to try. This is on cassandra 0.7.8.
The original cluster was 3 machines in the DCMTL, equally balanced at 33.33%
each and each holding roughly 34G.
Then, I added to it 3 machines in the LA data center. The ring is currently as
follows (IP addresses redacted for clarity):
Address Status State Load Owns Token
151236607520417094872610936636341427313
IPLA1 Up Normal 34.57 GB 11.11% 0
IPMTL1 Up Normal 34.43 GB 22.22%
37809151880104273718152734159085356828
IPLA2 Up Normal 17.55 GB 11.11%
56713727820156410577229101238628035242
IPMTL2 Up Normal 34.56 GB 22.22%
94522879700260684295381835397713392071
IPLA3 Up Normal 51.37 GB 11.11%
113427455640312821154458202477256070485
IPMTL3 Up Normal 34.71 GB 22.22%
151236607520417094872610936636341427313
The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in
yet another data center, but they're not ready yet to join the cluster. Once
that third DC joins all nodes will be at 11.11%. However, I don't think this is
related.
The problem I'm currently observing is visible in the LA machines, specifically
IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 has 150% the
expected volume.
Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3
LA nodes:
34.57 17.55 51.37
(the same 2:1:3 ratio is reflected in our internal tools trending reads/second
and writes/second)
I've tried several iterations of compactions/cleanups to no avail. In terms of
config this is the main keyspace:
Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Options: [DCMTL:2, DCLA:2]
And this is the cassandra-topology.properties file (IPs again redacted for
clarity):
IPMTL1:DCMTL:RAC1
IPMTL2:DCMTL:RAC1
IPMTL3:DCMTL:RAC1
IPLA1:DCLA:RAC1
IPLA2:DCLA:RAC1
IPLA3:DCLA::RAC1
IPLON1:DCLON:RAC1
IPLON2:DCLON:RAC1
IPLON3:DCLON:RAC1
# default for unknown nodes
default=DCBAD:RACBAD
One thing that did occur to me while reading the source code for the
NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing
data on different racks. Since all my machines are defined as in the same
rack, I believe that the 2-pass approach would still yield balanced placement.
However, just to test, I modified live the topology file to specify that IPLA1,
IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately
that the reads/second and writes/second equalized to expected fair volume (I
quickly reverted that change).
So, it seems somehow related to rack awareness, but I've been raking my head
and I can't figure out how/why, or why the three MTL machines are not affected
the same way.
If the solution is to specify them in different racks and run repair on
everything, I'm okay with that - but I hate doing that without first
understanding *why* the current behavior is the way it is.
Any ideas would be hugely appreciated.
Thank you.