Cool. ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com
On 11 Aug 2011, at 02:45, Mina Naguib wrote: > > Hi Aaron > > Thank you very much for the reply and the pointers to the previous list > discussions. The second was was particularly telling. > > I'm happy to say that the problem is fixed, and it's so trivial it's quite > embarrassing - but I'll state it here for the sake of the archives. > > There was an extra semicolon in the topology file in the line defining IPLA3. > It's just as visible in my prod config as it is in my example below ;-) > > I'm guessing the parser splits <dc, rack> tuples on (":"), so it probably > parsed the IPLA3 entry as "DCLA" , ":RAC1" (which is different than the > others on "RAC1"), and so the NTS did its thing distributing evenly between > racks, and IPLA3 got more of the data and IPLA2 got less. > > I''ve fixed it, and the reads/s and writes/s immediately equalized. I'm now > doing a round of repairs/compactions/cleanups to equalize the data load as > well. > > Unfortunately It's not easy in cassandra 0.7.8 to actually see the parsed > topology state (unlike 0.8's nice ring output which shows the DC and rack), > so I'm ashamed to say it took much longer than it should've to troubleshoot. > > Thanks for your help. > > > On 2011-08-10, at 5:12 AM, aaron morton wrote: > >> WRT the load imbalance checking the basics: you've run cleanup after any >> tokens moves? Repair is running ? Also sometimes nodes get a bit bloated >> from repair and will settle down with compaction. >> >> Your slightly odd tokens in the MTL DC are making it a little tricky to >> understand whats going on. But I'm trying to check if you've followed the >> multi DC token selection here >> http://wiki.apache.org/cassandra/Operations#Token_selection . Background >> about what can happen in a multi dc deployment if the tokens are not right >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html >> >> This is what you currently haveā¦. >> >> DC:LA >> IPLA1 Up Normal 34.57 GB 11.11% 0 >> >> IPLA2 Up Normal 17.55 GB 11.11% >> 56713727820156410577229101238628035242 >> IPLA3 Up Normal 51.37 GB 11.11% >> 113427455640312821154458202477256070485 >> >> DC: MTL >> IPMTL1 Up Normal 34.43 GB 22.22% >> 37809151880104273718152734159085356828 >> IPMTL2 Up Normal 34.56 GB 22.22% >> 94522879700260684295381835397713392071 >> IPMTL3 Up Normal 34.71 GB 22.22% >> 151236607520417094872610936636341427313 >> >> Using the bump approach you would have >> >> IPLA1 0 >> IPLA2 56713727820156410577229101238628035242 >> IPLA3 113427455640312821154458202477256070484 >> >> IPMTL1 1 >> IPMTL2 56713727820156410577229101238628035243 >> IPMTL3 113427455640312821154458202477256070485 >> >> Using the interleaving you would have >> >> IPLA1 0 >> IPMTL1 28356863910078205288614550619314017621 >> IPLA2 56713727820156410577229101238628035242 >> IPMTL2 85070591730234615865843651857942052863 >> IPLA3 113427455640312821154458202477256070484 >> IPMTL3 141784319550391026443072753096570088105 >> >> The current setup in LA give each node in LA 33% of the LA local ring. Which >> should be right, just checking. >> >> If cleanup / repair / compaction is all good and you are confident the >> tokens are right try poking around with nodetool getendpoints to see which >> nodes keys are sent to. Like you I cannot see anything obvious in NTS that >> would cause load to be imbalanced if they are all in the same rack. >> >> Cheers >> >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 10 Aug 2011, at 11:24, Mina Naguib wrote: >> >>> Hi everyone >>> >>> I'm observing a very peculiar type of imbalance and I'd appreciate any help >>> or ideas to try. This is on cassandra 0.7.8. >>> >>> The original cluster was 3 machines in the DCMTL, equally balanced at >>> 33.33% each and each holding roughly 34G. >>> >>> Then, I added to it 3 machines in the LA data center. The ring is >>> currently as follows (IP addresses redacted for clarity): >>> >>> Address Status State Load Owns Token >>> >>> >>> 151236607520417094872610936636341427313 >>> IPLA1 Up Normal 34.57 GB 11.11% 0 >>> >>> IPMTL1 Up Normal 34.43 GB 22.22% >>> 37809151880104273718152734159085356828 >>> IPLA2 Up Normal 17.55 GB 11.11% >>> 56713727820156410577229101238628035242 >>> IPMTL2 Up Normal 34.56 GB 22.22% >>> 94522879700260684295381835397713392071 >>> IPLA3 Up Normal 51.37 GB 11.11% >>> 113427455640312821154458202477256070485 >>> IPMTL3 Up Normal 34.71 GB 22.22% >>> 151236607520417094872610936636341427313 >>> >>> The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines >>> in yet another data center, but they're not ready yet to join the cluster. >>> Once that third DC joins all nodes will be at 11.11%. However, I don't >>> think this is related. >>> >>> The problem I'm currently observing is visible in the LA machines, >>> specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 >>> has 150% the expected volume. >>> >>> Putting their load side by side shows the peculiar ratio of 2:1:3 between >>> the 3 LA nodes: >>> 34.57 17.55 51.37 >>> (the same 2:1:3 ratio is reflected in our internal tools trending >>> reads/second and writes/second) >>> >>> I've tried several iterations of compactions/cleanups to no avail. In >>> terms of config this is the main keyspace: >>> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy >>> Options: [DCMTL:2, DCLA:2] >>> And this is the cassandra-topology.properties file (IPs again redacted for >>> clarity): >>> IPMTL1:DCMTL:RAC1 >>> IPMTL2:DCMTL:RAC1 >>> IPMTL3:DCMTL:RAC1 >>> IPLA1:DCLA:RAC1 >>> IPLA2:DCLA:RAC1 >>> IPLA3:DCLA::RAC1 >>> IPLON1:DCLON:RAC1 >>> IPLON2:DCLON:RAC1 >>> IPLON3:DCLON:RAC1 >>> # default for unknown nodes >>> default=DCBAD:RACBAD >>> >>> >>> One thing that did occur to me while reading the source code for the >>> NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers >>> placing data on different racks. Since all my machines are defined as in >>> the same rack, I believe that the 2-pass approach would still yield >>> balanced placement. >>> >>> However, just to test, I modified live the topology file to specify that >>> IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw >>> immediately that the reads/second and writes/second equalized to expected >>> fair volume (I quickly reverted that change). >>> >>> So, it seems somehow related to rack awareness, but I've been raking my >>> head and I can't figure out how/why, or why the three MTL machines are not >>> affected the same way. >>> >>> If the solution is to specify them in different racks and run repair on >>> everything, I'm okay with that - but I hate doing that without first >>> understanding *why* the current behavior is the way it is. >>> >>> Any ideas would be hugely appreciated. >>> >>> Thank you. >>> >> >