The main thing is to test on your data - 25k works great for me but if your values are substantially smaller or larger it might not for you.
Not specific to batches, but if you have a decent size cluster and have to do lots of inserts make sure your client is multi-threaded so that it's not the bottleneck. On Tue, May 11, 2010 at 8:31 AM, David Boxenhorn <da...@lookin2.com> wrote: > Thanks a lot! 25,000 is a number I can work with. > > Any other suggestions? > > On Tue, May 11, 2010 at 3:21 PM, Ben Browning <ben...@gmail.com> wrote: >> >> I like to base my batch sizes off of the total number of columns >> instead of the number of rows. This effectively means counting the >> number of Mutation objects in your mutation map and submitting the >> batch once it reaches a certain size. For my data, batch sizes of >> about 25,000 columns work best. You'll need to adjust this up or down >> depending on the size of your column names / values and available >> memory. >> >> With this strategy the "bushiness" of your rows shouldn't be a problem. >> >> Ben >> >> >> On Tue, May 11, 2010 at 7:54 AM, David Boxenhorn <da...@lookin2.com> >> wrote: >> > I am saving a large amount of data to Cassandra using batch mutate. I >> > have >> > found that my speed is proportional to the size of the batch. It was >> > very >> > slow when I was inserting one row at a time, but when I created batches >> > of >> > 100 rows and mutated them together, it went 100 times faster. (OK, I >> > didn't >> > measure it, but it was MUCH faster.) >> > >> > My problem is that my rows are of very varying degrees of bushiness >> > (i.e. >> > number of supercolums and columns per row). I inserted 592,500 rows >> > successfully, in a few minutes, and then I hit a batch of exceptionally >> > bushy rows and ran out of memory. >> > >> > Does anyone have any suggestions about how to deal with this problem? I >> > can >> > make my algorithm smarter by taking into account the size of the rows >> > and >> > not just blindly do 100 at a time, but I want to solve this problem as >> > generally as possible, and not depend on trial and error, and on the >> > specific configuration of the machine I happen to be working on right >> > now. I >> > don't even know if the critical parameter is the total size of the >> > values, >> > or the number of columns, or what? Or maybe there's some optimal batch >> > size, >> > and that's what I should use always? >> > >> > Thanks. >> > > >