HELP with bulk loading

Artur R Thu, 09 Mar 2017 15:31:59 -0800

Hello all!

There are ~500gb of CSV files and I am trying to find the way how to upload
them to C* table (new empty C* cluster of 3 nodes, replication factor 2)
within reasonable time (say, 10 hours using 3-4 instance of c3.8xlarge EC2
nodes).


My first impulse was to use CQLSSTableWriter, but it is too slow is of
single instance and I can't efficiently parallelize it (just creating Java
threads) because after some moment it always "hangs" (looks like GC is
overstressed) and eats all available memory.

So the questions are:
1. What is the best way to bulk-load huge amount of data to new C* cluster?

This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:

The preferred way to bulk load is now COPY; see CASSANDRA-11053
> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked tickets


is confusing because I read that the CQLSSTableWriter + sstableloader is
much faster than COPY. Who is right?

2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?

3. sstableloader is slow too. Assuming that I have new empty C* cluster,
how can I improve the upload speed? Maybe disable replication or some other
settings while streaming and then turn it back?

Thanks!
Artur.

HELP with bulk loading

Reply via email to