Re: loading big amount of data to Cassandra

pat Sat, 03 Aug 2019 03:07:03 -0700

Thanks to all,

I'll try the SSTables.


Thanks

Pat

On 2019-08-03 09:54, Dimo Velev wrote:

Check out the CQLSSTableWriter java class -
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java
. You use it to generate sstables - you need to write a small program
for that. You can then stream them over the network using the
sstableloader (either use the utility or use the underlying classes to
embed it in your program).

On 3. Aug 2019, at 07:17, Ayub M <[email protected]> wrote:

Dimo, how do you generate sstables? Do you mean load data locally on
a cassandra node and use sstableloader?

On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <[email protected]>
wrote:

Hi,

Batches will actually slow down the process because they mean a
different thing in C* - as you read they are just grouping changes
together that you want executed atomically.

Cassandra does not really have indices so that is different than a
relational DB. However, after writing stuff to Cassandra it
generates many smallish partitions of the data. These are then
joined in the background together to improve read performance.

You have two options from my experience:

Option 1: use normal CQL api in async mode. This will create a
high CPU load on your cluster. Depending on whether that is fine
for you that might be the easiest solution.

Option 2: generate sstables locally and use the sstableloader to
upload them into the cluster. The streaming does not generate high
cpu load so it is a viable option for clusters with other
operational load.

Option 2 scales with the number of cores of the machine generating
the sstables. If you can split your data you can generate sstables
on multiple machines. In contrast, option 1 scales with your
cluster. If you have a large cluster that is idling, it would be
better to use option 1.

With both options I was able to write at about 50-100K rows / sec
on my laptop and local Cassandra. The speed heavily depends on the
size of your rows.

Back to your question — I guess option2 is similar to what you
are used to from tools like sqlloader for relational DBMSes

I had a requirement of loading a few 100 mio rows per day into an
operational cluster so I went with option 2 to offload the cpu
load to reduce impact on the reading side during the loads.

Cheers,
Dimo

Sent from my iPad

On 2. Aug 2019, at 18:59, [email protected] wrote:

Hi,

I need to upload to Cassandra about 7 billions of records. What

is the best setup of Cassandra for this task? Will usage of batch
speeds up the upload (I've read somewhere that batch in Cassandra
is dedicated to atomicity not to speeding up communication)? How
Cassandra internally works related to indexing? In SQL databases
when uploading such amount of data is suggested to turn off
indexing and then turn on. Is something simmillar possible in
Cassandra?


Thanks for all suggestions.

Pat

----------------------------------------
Freehosting PIPNI - http://www.pipni.cz/

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------------

Freehosting PIPNI - http://www.pipni.cz/


----------------------------------------
Freehosting PIPNI - http://www.pipni.cz/


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: loading big amount of data to Cassandra

Reply via email to