Hi, >3. sstableloader is slow too. Assuming that I have new empty C* cluster, how can I improve the upload speed? Maybe disable replication or some other settings while streaming and then turn it back?
Maybe you can accelerate you load with the option -cph (connection per host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000 With cph=12 and t=1000, I went from 56min (default value) to 11min for table of 50Gb. 2017-03-10 2:09 GMT+01:00 Stefania Alborghetti < [email protected]>: > When I tested cqlsh COPY FROM for CASSANDRA-11053 > <https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>, > I was able to import about 20 GB in under 4 minutes on a cluster with 8 > nodes using the same benchmark created for cassandra-loader, provided the > driver was Cythonized, instructions in this blog post > <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>. > The performance was similar to cassandra-loader. > > Depending on your schema, one or the other may do slightly better. > > On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <[email protected]> wrote: > >> I suggest using cassandra loader >> >> https://github.com/brianmhess/cassandra-loader >> >> On Mar 9, 2017 5:30 PM, "Artur R" <[email protected]> wrote: >> >>> Hello all! >>> >>> There are ~500gb of CSV files and I am trying to find the way how to >>> upload them to C* table (new empty C* cluster of 3 nodes, replication >>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of >>> c3.8xlarge EC2 nodes). >>> >>> My first impulse was to use CQLSSTableWriter, but it is too slow is of >>> single instance and I can't efficiently parallelize it (just creating Java >>> threads) because after some moment it always "hangs" (looks like GC is >>> overstressed) and eats all available memory. >>> >>> So the questions are: >>> 1. What is the best way to bulk-load huge amount of data to new C* >>> cluster? >>> >>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323: >>> >>> The preferred way to bulk load is now COPY; see CASSANDRA-11053 >>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked >>>> tickets >>> >>> >>> is confusing because I read that the CQLSSTableWriter + sstableloader is >>> much faster than COPY. Who is right? >>> >>> 2. Is there any real examples of multi-threaded using of >>> CQLSSTableWriter? >>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass >>> ? >>> >>> 3. sstableloader is slow too. Assuming that I have new empty C* cluster, >>> how can I improve the upload speed? Maybe disable replication or some other >>> settings while streaming and then turn it back? >>> >>> Thanks! >>> Artur. >>> >> > > > -- > > <http://www.datastax.com/> > > STEFANIA ALBORGHETTI > > Software engineer | +852 6114 9265 <+852%206114%209265> | > [email protected] > > > [image: http://www.datastax.com/cloud-applications] > <http://www.datastax.com/cloud-applications> > > > > -- Cordialement; Ahmed ELJAMI
