Re: loading big amount of data to Cassandra

Ayub M Fri, 02 Aug 2019 22:18:25 -0700

Dimo, how do you generate sstables? Do you mean load data locally on a
cassandra node and use sstableloader?


On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <dimo.ve...@gmail.com> wrote:

> Hi,
>
> Batches will actually slow down the process because they mean a different
> thing in C* - as you read they are just grouping changes together that you
> want executed atomically.
>
> Cassandra does not really have indices so that is different than a
> relational DB. However, after writing stuff to Cassandra it generates many
> smallish partitions of the data. These are then joined in the background
> together to improve read performance.
>
> You have two options from my experience:
>
> Option 1: use normal CQL api in async mode. This will create a high CPU
> load on your cluster. Depending on whether that is fine for you that might
> be the easiest solution.
>
> Option 2: generate sstables locally and use the sstableloader to upload
> them into the cluster. The streaming does not generate high cpu load so it
> is a viable option for clusters with other operational load.
>
> Option 2 scales with the number of cores of the machine generating the
> sstables. If you can split your data you can generate sstables on multiple
> machines. In contrast, option 1 scales with your cluster. If you have a
> large cluster that is idling, it would be better to use option 1.
>
> With both options I was able to write at about 50-100K rows / sec on my
> laptop and local Cassandra. The speed heavily depends on the size of your
> rows.
>
> Back to your question — I guess option2 is similar to what you are used to
> from tools like sqlloader for relational DBMSes
>
> I had a requirement of loading a few 100 mio rows per day into an
> operational cluster so I went with option 2 to offload the cpu load to
> reduce impact on the reading side during the loads.
>
> Cheers,
> Dimo
>
>
> Sent from my iPad
>
> > On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
> >
> > Hi,
> >
> > I need to upload to Cassandra about 7 billions of records. What is the
> best setup of Cassandra for this task? Will usage of batch speeds up the
> upload (I've read somewhere that batch in Cassandra is dedicated to
> atomicity not to speeding up communication)? How Cassandra internally works
> related to indexing? In SQL databases when uploading such amount of data is
> suggested to turn off indexing and then turn on. Is something simmillar
> possible in Cassandra?
> >
> > Thanks for all suggestions.
> >
> > Pat
> >
> > ----------------------------------------
> > Freehosting PIPNI - http://www.pipni.cz/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: loading big amount of data to Cassandra

Reply via email to