Check out the CQLSSTableWriter java class - https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java . You use it to generate sstables - you need to write a small program for that. You can then stream them over the network using the sstableloader (either use the utility or use the underlying classes to embed it in your program).
> On 3. Aug 2019, at 07:17, Ayub M <hia...@gmail.com> wrote: > > Dimo, how do you generate sstables? Do you mean load data locally on a > cassandra node and use sstableloader? > >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <dimo.ve...@gmail.com> wrote: >> Hi, >> >> Batches will actually slow down the process because they mean a different >> thing in C* - as you read they are just grouping changes together that you >> want executed atomically. >> >> Cassandra does not really have indices so that is different than a >> relational DB. However, after writing stuff to Cassandra it generates many >> smallish partitions of the data. These are then joined in the background >> together to improve read performance. >> >> You have two options from my experience: >> >> Option 1: use normal CQL api in async mode. This will create a high CPU load >> on your cluster. Depending on whether that is fine for you that might be the >> easiest solution. >> >> Option 2: generate sstables locally and use the sstableloader to upload them >> into the cluster. The streaming does not generate high cpu load so it is a >> viable option for clusters with other operational load. >> >> Option 2 scales with the number of cores of the machine generating the >> sstables. If you can split your data you can generate sstables on multiple >> machines. In contrast, option 1 scales with your cluster. If you have a >> large cluster that is idling, it would be better to use option 1. >> >> With both options I was able to write at about 50-100K rows / sec on my >> laptop and local Cassandra. The speed heavily depends on the size of your >> rows. >> >> Back to your question — I guess option2 is similar to what you are used to >> from tools like sqlloader for relational DBMSes >> >> I had a requirement of loading a few 100 mio rows per day into an >> operational cluster so I went with option 2 to offload the cpu load to >> reduce impact on the reading side during the loads. >> >> Cheers, >> Dimo >> >> >> Sent from my iPad >> >> > On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote: >> > >> > Hi, >> > >> > I need to upload to Cassandra about 7 billions of records. What is the >> > best setup of Cassandra for this task? Will usage of batch speeds up the >> > upload (I've read somewhere that batch in Cassandra is dedicated to >> > atomicity not to speeding up communication)? How Cassandra internally >> > works related to indexing? In SQL databases when uploading such amount of >> > data is suggested to turn off indexing and then turn on. Is something >> > simmillar possible in Cassandra? >> > >> > Thanks for all suggestions. >> > >> > Pat >> > >> > ---------------------------------------- >> > Freehosting PIPNI - http://www.pipni.cz/ >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> > For additional commands, e-mail: user-h...@cassandra.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >>