When I tested cqlsh COPY FROM for CASSANDRA-11053 <https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>, I was able to import about 20 GB in under 4 minutes on a cluster with 8 nodes using the same benchmark created for cassandra-loader, provided the driver was Cythonized, instructions in this blog post <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>. The performance was similar to cassandra-loader.
Depending on your schema, one or the other may do slightly better. On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <r...@foundev.pro> wrote: > I suggest using cassandra loader > > https://github.com/brianmhess/cassandra-loader > > On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote: > >> Hello all! >> >> There are ~500gb of CSV files and I am trying to find the way how to >> upload them to C* table (new empty C* cluster of 3 nodes, replication >> factor 2) within reasonable time (say, 10 hours using 3-4 instance of >> c3.8xlarge EC2 nodes). >> >> My first impulse was to use CQLSSTableWriter, but it is too slow is of >> single instance and I can't efficiently parallelize it (just creating Java >> threads) because after some moment it always "hangs" (looks like GC is >> overstressed) and eats all available memory. >> >> So the questions are: >> 1. What is the best way to bulk-load huge amount of data to new C* >> cluster? >> >> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323: >> >> The preferred way to bulk load is now COPY; see CASSANDRA-11053 >>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked >>> tickets >> >> >> is confusing because I read that the CQLSSTableWriter + sstableloader is >> much faster than COPY. Who is right? >> >> 2. Is there any real examples of multi-threaded using of CQLSSTableWriter? >> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass? >> >> 3. sstableloader is slow too. Assuming that I have new empty C* cluster, >> how can I improve the upload speed? Maybe disable replication or some other >> settings while streaming and then turn it back? >> >> Thanks! >> Artur. >> > -- <http://www.datastax.com/> STEFANIA ALBORGHETTI Software engineer | +852 6114 9265 | stefania.alborghe...@datastax.com [image: http://www.datastax.com/cloud-applications] <http://www.datastax.com/cloud-applications>