Hi Bhuvan

Support for large datasets in COPY FROM was added by CASSANDRA-11053
<https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is available
in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this patch
applied.

The 3.0.x and 3.x releases are already available, whilst the other two
releases are due in the next few days. You only need to install an
up-to-date release on the machine where COPY FROM is running.

You may find the setup instructions in this blog
<http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>
interesting. Specifically, for large datasets, I would highly recommend
installing the Python driver with C extensions, as it will speed things up
considerably. Again, this is only possible with the 11053 patch. Please
ignore the suggestion to also compile the cqlsh copy module itself with C
extensions (Cython), as you may hit CASSANDRA-11574
<https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5 and
3.5 releases.

Before CASSANDRA-11053, the parent process was a bottleneck. This is
explained further in  this blog
<http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>,
second paragraph in the "worker processes" section. As a workaround, if you
are unable to upgrade, you may try reducing the INGESTRATE and introducing
a few extra worker processes via NUMPROCESSES. Also, the parent process is
overloaded and is therefore not able to report progress correctly.
Therefore, if the progress report is frozen, it doesn't mean the COPY
OPERATION is not making progress.

Do let us know if you still have problems, as this is new functionality.

With best regards,
Stefania


On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> Hi,
>
> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster
> with 32 GB memory each, sufficient disk, RF-1 and durable write false. The
> machine im feeding into is external to the cluster and shares 1GBps line
> and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO
> usage).
>
> Im trying to use COPY command to feed in data. It kicks off well, launches
> a set of processes, does about 50,000 rows per second. But I can see that
> the parent process starts aggregating memory almost of the size of data
> processed and after a point the processes just hang. The parent process was
> consuming 95% system memory when it had processed around 60% data.
>
> I had earlier tried to feed in data from multiple files (Less than 4GB
> each) and it was working as expected.
>
> Is it a valid scenario?
>
> Regards,
> Bhuvan
>



-- 


[image: datastax_logo.png] <http://www.datastax.com/>

Stefania Alborghetti

Apache Cassandra Software Engineer

|+852 6114 9265| stefania.alborghe...@datastax.com


[image: cassandrasummit.org/Email_Signature]
<http://cassandrasummit.org/Email_Signature>

Reply via email to