Re: loading big amount of data to Cassandra

Dimo Velev Fri, 02 Aug 2019 14:48:49 -0700

Hi,

Batches will actually slow down the process because they mean a different thing 
in C* - as you read they are just grouping changes together that you want 
executed atomically.

Cassandra does not really have indices so that is different than a relational 
DB. However, after writing stuff to Cassandra it generates many smallish 
partitions of the data. These are then joined in the background together to 
improve read performance.

You have two options from my experience:

Option 1: use normal CQL api in async mode. This will create a high CPU load on 
your cluster. Depending on whether that is fine for you that might be the 
easiest solution.

Option 2: generate sstables locally and use the sstableloader to upload them 
into the cluster. The streaming does not generate high cpu load so it is a 
viable option for clusters with other operational load.

Option 2 scales with the number of cores of the machine generating the 
sstables. If you can split your data you can generate sstables on multiple 
machines. In contrast, option 1 scales with your cluster. If you have a large 
cluster that is idling, it would be better to use option 1.

With both options I was able to write at about 50-100K rows / sec on my laptop 
and local Cassandra. The speed heavily depends on the size of your rows.

Back to your question — I guess option2 is similar to what you are used to from 
tools like sqlloader for relational DBMSes

I had a requirement of loading a few 100 mio rows per day into an operational 
cluster so I went with option 2 to offload the cpu load to reduce impact on the 
reading side during the loads. 

Cheers,
Dimo

Sent from my iPad

> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
> 
> Hi,
> 
> I need to upload to Cassandra about 7 billions of records. What is the best 
> setup of Cassandra for this task? Will usage of batch speeds up the upload 
> (I've read somewhere that batch in Cassandra is dedicated to atomicity not to 
> speeding up communication)? How Cassandra internally works related to 
> indexing? In SQL databases when uploading such amount of data is suggested to 
> turn off indexing and then turn on. Is something simmillar possible in 
> Cassandra?
> 
> Thanks for all suggestions.
> 
> Pat
> 
> ----------------------------------------
> Freehosting PIPNI - http://www.pipni.cz/
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: loading big amount of data to Cassandra

Reply via email to