> in the family. There are millions of rows. Each operation consists of
> doing a batch_insert through pycassa, which increments ~17k keys. A
> majority of these keys are new in each batch.
> 
>  Each operation is taking up to 15 seconds. For our system this is a
> significant bottleneck.
> 

Try to split your batch to smaller pieces and launch them in parallel. This way
you may get better performance, because all cores are employed and there will be
less copying/rebuilding of large structures inside thrift & cassandra. I found
that 1k rows in a batch is behaving better than 10k.

It is also a good idea to split batch to slices according to replication
strategy and communicate appropriate slice directly to its natural endpoint.
This will reduce neccessary intercommunication between nodes.




Reply via email to