Re: loading big amount of data to Cassandra

Dimo Velev Sat, 03 Aug 2019 00:55:15 -0700

Check out the CQLSSTableWriter java class -  
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java
 . You use it to generate sstables - you need to write a small program for 
that. You can then stream them over the network using the sstableloader (either 
use the utility or use the underlying classes to embed it in your program).



> On 3. Aug 2019, at 07:17, Ayub M <hia...@gmail.com> wrote:
> 
> Dimo, how do you generate sstables? Do you mean load data locally on a 
> cassandra node and use sstableloader? 
> 
>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <dimo.ve...@gmail.com> wrote:
>> Hi,
>> 
>> Batches will actually slow down the process because they mean a different 
>> thing in C* - as you read they are just grouping changes together that you 
>> want executed atomically. 
>> 
>> Cassandra does not really have indices so that is different than a 
>> relational DB. However, after writing stuff to Cassandra it generates many 
>> smallish partitions of the data. These are then joined in the background 
>> together to improve read performance.
>> 
>> You have two options from my experience:
>> 
>> Option 1: use normal CQL api in async mode. This will create a high CPU load 
>> on your cluster. Depending on whether that is fine for you that might be the 
>> easiest solution.
>> 
>> Option 2: generate sstables locally and use the sstableloader to upload them 
>> into the cluster. The streaming does not generate high cpu load so it is a 
>> viable option for clusters with other operational load.
>> 
>> Option 2 scales with the number of cores of the machine generating the 
>> sstables. If you can split your data you can generate sstables on multiple 
>> machines. In contrast, option 1 scales with your cluster. If you have a 
>> large cluster that is idling, it would be better to use option 1.
>> 
>> With both options I was able to write at about 50-100K rows / sec on my 
>> laptop and local Cassandra. The speed heavily depends on the size of your 
>> rows.
>> 
>> Back to your question — I guess option2 is similar to what you are used to 
>> from tools like sqlloader for relational DBMSes
>> 
>> I had a requirement of loading a few 100 mio rows per day into an 
>> operational cluster so I went with option 2 to offload the cpu load to 
>> reduce impact on the reading side during the loads. 
>> 
>> Cheers,
>> Dimo
>> 
>> 
>> Sent from my iPad
>> 
>> > On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
>> > 
>> > Hi,
>> > 
>> > I need to upload to Cassandra about 7 billions of records. What is the 
>> > best setup of Cassandra for this task? Will usage of batch speeds up the 
>> > upload (I've read somewhere that batch in Cassandra is dedicated to 
>> > atomicity not to speeding up communication)? How Cassandra internally 
>> > works related to indexing? In SQL databases when uploading such amount of 
>> > data is suggested to turn off indexing and then turn on. Is something 
>> > simmillar possible in Cassandra?
>> > 
>> > Thanks for all suggestions.
>> > 
>> > Pat
>> > 
>> > ----------------------------------------
>> > Freehosting PIPNI - http://www.pipni.cz/
>> > 
>> > 
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>> > 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>

Re: loading big amount of data to Cassandra

Reply via email to