We're currently on 0.6.0 waiting for the full release of 0.7 before we upgrade. We have other Thrift/PHP code to update whenever we upgrade Cassandra, so we don't want to upgrade to a release candidate on our production system.

We *did* have a problem with a column family setup where we had few rows (probably hundreds?), and those few rows exceeding 100MB in size, so we migrated that column family's data to a new column family and stored that old data into what would now be hundreds of thousands of rows. Our largest row size right now is in the ballpark of a few hundred kilobytes.

Is there any way to determine via a "nodetool cfstats" (or similar) how many rows we have per column family to help answer your second question a little better? I do know from our migration that we created something like 30,000 smaller rows for each of the bigger rows, and then we deleted that old column family from our nodes (first by removing it from the XML configuration, then by deleting files at the OS level). When our migration finished, we *still* saw this large imbalance, which is what prompted my questions, and led to "nodetool move" to reset our token values, etc., but even running cleanups, flushes and repairs on each node individually, we're still left with this imbalanced load.

Thanks for your help. Let me know if there's any additional information I can give.


On 01/06/2011 10:39 AM, Peter Schuller wrote:
I posted row sizes (min/max/mean) of our largest data set in my original
message, but had zero responses on the mailing list. The folks in IRC told
me to wait it out, see if to rebalanced on its own (it didn't), or to run a
repair on each node one at a time (didn't help), and that it wasn't a big
concern until we had "dozens of GBs" worth of data.
Ok. It may not be a concern practically right now, but an unexplained
imbalance is not good. First off, is this the very latest 0.6 release
or else one of the 0.7 rc:s, or is this an old 0.6? Not that I
remember off hand whether there were any bugs fixed in the 0.6 series
that would explain this particular behavior, but it's probably a good
start to ask if you have the latest version.

Also, you mentioned originally that "Our row min/max/mean values are
mostly the same". I'm not entirely positive to what you are referring;
the important points I wanted to ask about are:

(1) Do you have "many" keys (say, thousands or more) so that there
should be no statistically significant imbalance between the nodes in
terms of the *number* of rows?

(2) How sure are you about the distribution of row sizes; is it
possible you have a small number of very large rows that are screwing
up the statistics?

Reply via email to