Re: HELP with bulk loading

Artur R Tue, 14 Mar 2017 04:31:38 -0700

Thank you all!
It turns out that the fastest ways are: https://github.com/brianmhess/
cassandra-loader and COPY FROM.


So I decided to stick with COPY FROM as it built-in and easy-to-use.

On Fri, Mar 10, 2017 at 2:22 PM, Ahmed Eljami <ahmed.elj...@gmail.com>
wrote:

> Hi,
>
> >3. sstableloader is slow too. Assuming that I have new empty C* cluster,
> how can I improve the upload speed? Maybe disable replication or some other
> settings while streaming and then turn it back?
>
> Maybe you can accelerate you load with the option -cph (connection per
> host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000
>
> With cph=12 and t=1000,  I went from 56min (default value) to 11min for
> table of 50Gb.
>
>
>
> 2017-03-10 2:09 GMT+01:00 Stefania Alborghetti <stefania.alborghetti@
> datastax.com>:
>
>> When I tested cqlsh COPY FROM for CASSANDRA-11053
>> <https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>,
>> I was able to import about 20 GB in under 4 minutes on a cluster with 8
>> nodes using the same benchmark created for cassandra-loader, provided the
>> driver was Cythonized, instructions in this blog post
>> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>.
>> The performance was similar to cassandra-loader.
>>
>> Depending on your schema, one or the other may do slightly better.
>>
>> On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <r...@foundev.pro> wrote:
>>
>>> I suggest using cassandra loader
>>>
>>> https://github.com/brianmhess/cassandra-loader
>>>
>>> On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote:
>>>
>>>> Hello all!
>>>>
>>>> There are ~500gb of CSV files and I am trying to find the way how to
>>>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>>>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>>>> c3.8xlarge EC2 nodes).
>>>>
>>>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>>>> single instance and I can't efficiently parallelize it (just creating Java
>>>> threads) because after some moment it always "hangs" (looks like GC is
>>>> overstressed) and eats all available memory.
>>>>
>>>> So the questions are:
>>>> 1. What is the best way to bulk-load huge amount of data to new C*
>>>> cluster?
>>>>
>>>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323
>>>> :
>>>>
>>>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>>>>> tickets
>>>>
>>>>
>>>> is confusing because I read that the CQLSSTableWriter + sstableloader
>>>> is much faster than COPY. Who is right?
>>>>
>>>> 2. Is there any real examples of multi-threaded using of
>>>> CQLSSTableWriter?
>>>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass
>>>> ?
>>>>
>>>> 3. sstableloader is slow too. Assuming that I have new empty C*
>>>> cluster, how can I improve the upload speed? Maybe disable replication or
>>>> some other settings while streaming and then turn it back?
>>>>
>>>> Thanks!
>>>> Artur.
>>>>
>>>
>>
>>
>> --
>>
>> <http://www.datastax.com/>
>>
>> STEFANIA ALBORGHETTI
>>
>> Software engineer | +852 6114 9265 <+852%206114%209265> |
>> stefania.alborghe...@datastax.com
>>
>>
>> [image: http://www.datastax.com/cloud-applications]
>> <http://www.datastax.com/cloud-applications>
>>
>>
>>
>>
>
>
> --
> Cordialement;
>
> Ahmed ELJAMI
>

Re: HELP with bulk loading

Reply via email to