Hello all! There are ~500gb of CSV files and I am trying to find the way how to upload them to C* table (new empty C* cluster of 3 nodes, replication factor 2) within reasonable time (say, 10 hours using 3-4 instance of c3.8xlarge EC2 nodes).
My first impulse was to use CQLSSTableWriter, but it is too slow is of single instance and I can't efficiently parallelize it (just creating Java threads) because after some moment it always "hangs" (looks like GC is overstressed) and eats all available memory. So the questions are: 1. What is the best way to bulk-load huge amount of data to new C* cluster? This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323: The preferred way to bulk load is now COPY; see CASSANDRA-11053 > <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked tickets is confusing because I read that the CQLSSTableWriter + sstableloader is much faster than COPY. Who is right? 2. Is there any real examples of multi-threaded using of CQLSSTableWriter? Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass? 3. sstableloader is slow too. Assuming that I have new empty C* cluster, how can I improve the upload speed? Maybe disable replication or some other settings while streaming and then turn it back? Thanks! Artur.