> I posted row sizes (min/max/mean) of our largest data set in my original > message, but had zero responses on the mailing list. The folks in IRC told > me to wait it out, see if to rebalanced on its own (it didn't), or to run a > repair on each node one at a time (didn't help), and that it wasn't a big > concern until we had "dozens of GBs" worth of data.
Ok. It may not be a concern practically right now, but an unexplained imbalance is not good. First off, is this the very latest 0.6 release or else one of the 0.7 rc:s, or is this an old 0.6? Not that I remember off hand whether there were any bugs fixed in the 0.6 series that would explain this particular behavior, but it's probably a good start to ask if you have the latest version. Also, you mentioned originally that "Our row min/max/mean values are mostly the same". I'm not entirely positive to what you are referring; the important points I wanted to ask about are: (1) Do you have "many" keys (say, thousands or more) so that there should be no statistically significant imbalance between the nodes in terms of the *number* of rows? (2) How sure are you about the distribution of row sizes; is it possible you have a small number of very large rows that are screwing up the statistics? -- / Peter Schuller