The main thing is to test on your data - 25k works great for me but if
your values are substantially smaller or larger it might not for you.

Not specific to batches, but if you have a decent size cluster and
have to do lots of inserts make sure your client is multi-threaded so
that it's not the bottleneck.

On Tue, May 11, 2010 at 8:31 AM, David Boxenhorn <da...@lookin2.com> wrote:
> Thanks a lot! 25,000 is a number I can work with.
>
> Any other suggestions?
>
> On Tue, May 11, 2010 at 3:21 PM, Ben Browning <ben...@gmail.com> wrote:
>>
>> I like to base my batch sizes off of the total number of columns
>> instead of the number of rows. This effectively means counting the
>> number of Mutation objects in your mutation map and submitting the
>> batch once it reaches a certain size. For my data, batch sizes of
>> about 25,000 columns work best. You'll need to adjust this up or down
>> depending on the size of your column names / values and available
>> memory.
>>
>> With this strategy the "bushiness" of your rows shouldn't be a problem.
>>
>> Ben
>>
>>
>> On Tue, May 11, 2010 at 7:54 AM, David Boxenhorn <da...@lookin2.com>
>> wrote:
>> > I am saving a large amount of data to Cassandra using batch mutate. I
>> > have
>> > found that my speed is proportional to the size of the batch. It was
>> > very
>> > slow when I was inserting one row at a time, but when I created batches
>> > of
>> > 100 rows and mutated them together, it went 100 times faster. (OK, I
>> > didn't
>> > measure it, but it was MUCH faster.)
>> >
>> > My problem is that my rows are of very varying degrees of bushiness
>> > (i.e.
>> > number of supercolums and columns per row). I inserted 592,500 rows
>> > successfully, in a few minutes, and then I hit a batch of exceptionally
>> > bushy rows and ran out of memory.
>> >
>> > Does anyone have any suggestions about how to deal with this problem? I
>> > can
>> > make my algorithm smarter by taking into account the size of the rows
>> > and
>> > not just blindly do 100 at a time, but I want to solve this problem as
>> > generally as possible, and not depend on trial and error, and on the
>> > specific configuration of the machine I happen to be working on right
>> > now. I
>> > don't even know if the critical parameter is the total size of the
>> > values,
>> > or the number of columns, or what? Or maybe there's some optimal batch
>> > size,
>> > and that's what I should use always?
>> >
>> > Thanks.
>> >
>
>

Reply via email to