I like to base my batch sizes off of the total number of columns instead of the number of rows. This effectively means counting the number of Mutation objects in your mutation map and submitting the batch once it reaches a certain size. For my data, batch sizes of about 25,000 columns work best. You'll need to adjust this up or down depending on the size of your column names / values and available memory.
With this strategy the "bushiness" of your rows shouldn't be a problem. Ben On Tue, May 11, 2010 at 7:54 AM, David Boxenhorn <da...@lookin2.com> wrote: > I am saving a large amount of data to Cassandra using batch mutate. I have > found that my speed is proportional to the size of the batch. It was very > slow when I was inserting one row at a time, but when I created batches of > 100 rows and mutated them together, it went 100 times faster. (OK, I didn't > measure it, but it was MUCH faster.) > > My problem is that my rows are of very varying degrees of bushiness (i.e. > number of supercolums and columns per row). I inserted 592,500 rows > successfully, in a few minutes, and then I hit a batch of exceptionally > bushy rows and ran out of memory. > > Does anyone have any suggestions about how to deal with this problem? I can > make my algorithm smarter by taking into account the size of the rows and > not just blindly do 100 at a time, but I want to solve this problem as > generally as possible, and not depend on trial and error, and on the > specific configuration of the machine I happen to be working on right now. I > don't even know if the critical parameter is the total size of the values, > or the number of columns, or what? Or maybe there's some optimal batch size, > and that's what I should use always? > > Thanks. >