I am saving a large amount of data to Cassandra using batch mutate. I have
found that my speed is proportional to the size of the batch. It was very
slow when I was inserting one row at a time, but when I created batches of
100 rows and mutated them together, it went 100 times faster. (OK, I didn't
measure it, but it was MUCH faster.)

My problem is that my rows are of very varying degrees of bushiness (i.e.
number of supercolums and columns per row). I inserted 592,500 rows
successfully, in a few minutes, and then I hit a batch of exceptionally
bushy rows and ran out of memory.

Does anyone have any suggestions about how to deal with this problem? I can
make my algorithm smarter by taking into account the size of the rows and
not just blindly do 100 at a time, but I want to solve this problem as
generally as possible, and not depend on trial and error, and on the
specific configuration of the machine I happen to be working on right now. I
don't even know if the critical parameter is the total size of the values,
or the number of columns, or what? Or maybe there's some optimal batch size,
and that's what I should use always?

Thanks.

Reply via email to