Thanks a lot! 25,000 is a number I can work with. Any other suggestions?
On Tue, May 11, 2010 at 3:21 PM, Ben Browning <ben...@gmail.com> wrote: > I like to base my batch sizes off of the total number of columns > instead of the number of rows. This effectively means counting the > number of Mutation objects in your mutation map and submitting the > batch once it reaches a certain size. For my data, batch sizes of > about 25,000 columns work best. You'll need to adjust this up or down > depending on the size of your column names / values and available > memory. > > With this strategy the "bushiness" of your rows shouldn't be a problem. > > Ben > > > On Tue, May 11, 2010 at 7:54 AM, David Boxenhorn <da...@lookin2.com> > wrote: > > I am saving a large amount of data to Cassandra using batch mutate. I > have > > found that my speed is proportional to the size of the batch. It was very > > slow when I was inserting one row at a time, but when I created batches > of > > 100 rows and mutated them together, it went 100 times faster. (OK, I > didn't > > measure it, but it was MUCH faster.) > > > > My problem is that my rows are of very varying degrees of bushiness (i.e. > > number of supercolums and columns per row). I inserted 592,500 rows > > successfully, in a few minutes, and then I hit a batch of exceptionally > > bushy rows and ran out of memory. > > > > Does anyone have any suggestions about how to deal with this problem? I > can > > make my algorithm smarter by taking into account the size of the rows and > > not just blindly do 100 at a time, but I want to solve this problem as > > generally as possible, and not depend on trial and error, and on the > > specific configuration of the machine I happen to be working on right > now. I > > don't even know if the critical parameter is the total size of the > values, > > or the number of columns, or what? Or maybe there's some optimal batch > size, > > and that's what I should use always? > > > > Thanks. > > >