I am saving a large amount of data to Cassandra using batch mutate. I have found that my speed is proportional to the size of the batch. It was very slow when I was inserting one row at a time, but when I created batches of 100 rows and mutated them together, it went 100 times faster. (OK, I didn't measure it, but it was MUCH faster.)
My problem is that my rows are of very varying degrees of bushiness (i.e. number of supercolums and columns per row). I inserted 592,500 rows successfully, in a few minutes, and then I hit a batch of exceptionally bushy rows and ran out of memory. Does anyone have any suggestions about how to deal with this problem? I can make my algorithm smarter by taking into account the size of the rows and not just blindly do 100 at a time, but I want to solve this problem as generally as possible, and not depend on trial and error, and on the specific configuration of the machine I happen to be working on right now. I don't even know if the critical parameter is the total size of the values, or the number of columns, or what? Or maybe there's some optimal batch size, and that's what I should use always? Thanks.