Thank you all! It turns out that the fastest ways are: https://github.com/brianmhess/ cassandra-loader and COPY FROM.
So I decided to stick with COPY FROM as it built-in and easy-to-use. On Fri, Mar 10, 2017 at 2:22 PM, Ahmed Eljami <ahmed.elj...@gmail.com> wrote: > Hi, > > >3. sstableloader is slow too. Assuming that I have new empty C* cluster, > how can I improve the upload speed? Maybe disable replication or some other > settings while streaming and then turn it back? > > Maybe you can accelerate you load with the option -cph (connection per > host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000 > > With cph=12 and t=1000, I went from 56min (default value) to 11min for > table of 50Gb. > > > > 2017-03-10 2:09 GMT+01:00 Stefania Alborghetti <stefania.alborghetti@ > datastax.com>: > >> When I tested cqlsh COPY FROM for CASSANDRA-11053 >> <https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>, >> I was able to import about 20 GB in under 4 minutes on a cluster with 8 >> nodes using the same benchmark created for cassandra-loader, provided the >> driver was Cythonized, instructions in this blog post >> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>. >> The performance was similar to cassandra-loader. >> >> Depending on your schema, one or the other may do slightly better. >> >> On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <r...@foundev.pro> wrote: >> >>> I suggest using cassandra loader >>> >>> https://github.com/brianmhess/cassandra-loader >>> >>> On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote: >>> >>>> Hello all! >>>> >>>> There are ~500gb of CSV files and I am trying to find the way how to >>>> upload them to C* table (new empty C* cluster of 3 nodes, replication >>>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of >>>> c3.8xlarge EC2 nodes). >>>> >>>> My first impulse was to use CQLSSTableWriter, but it is too slow is of >>>> single instance and I can't efficiently parallelize it (just creating Java >>>> threads) because after some moment it always "hangs" (looks like GC is >>>> overstressed) and eats all available memory. >>>> >>>> So the questions are: >>>> 1. What is the best way to bulk-load huge amount of data to new C* >>>> cluster? >>>> >>>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323 >>>> : >>>> >>>> The preferred way to bulk load is now COPY; see CASSANDRA-11053 >>>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked >>>>> tickets >>>> >>>> >>>> is confusing because I read that the CQLSSTableWriter + sstableloader >>>> is much faster than COPY. Who is right? >>>> >>>> 2. Is there any real examples of multi-threaded using of >>>> CQLSSTableWriter? >>>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass >>>> ? >>>> >>>> 3. sstableloader is slow too. Assuming that I have new empty C* >>>> cluster, how can I improve the upload speed? Maybe disable replication or >>>> some other settings while streaming and then turn it back? >>>> >>>> Thanks! >>>> Artur. >>>> >>> >> >> >> -- >> >> <http://www.datastax.com/> >> >> STEFANIA ALBORGHETTI >> >> Software engineer | +852 6114 9265 <+852%206114%209265> | >> stefania.alborghe...@datastax.com >> >> >> [image: http://www.datastax.com/cloud-applications] >> <http://www.datastax.com/cloud-applications> >> >> >> >> > > > -- > Cordialement; > > Ahmed ELJAMI >