Hi Ryan,

Thanks for your reply. Now I understood how SSTableLoader works.

   - If I understand correctly, the current o.a.c.io.sstable.SSTableLoader
   doesn't use LOCAL_ONE or LOCAL_QUORUM. Is it right?
   - Is it possible to modify SSTableLoader to allow it access one data
   center?

Because I may load ~100 million, I think spark-cassandra-connector might be
too slow. I'm wondering if the methods "*Copy-the-sstables/”nodetool
refresh” can be useful" in h*ttp://
www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good
choice. I'm still a newbie to Cassandra. I could not understand what the
author said in that page. One of my question is:

* When I run a spark job in yarn mode, the sstables are created into YARN
working directory.
* Assume I have a way to copy the files into the Cassandra directory on the
same node.
* Because the data are distributed across all analytics data center's
nodes, each one has only a part of sstables, node A has part A, node B has
part B. If I run refresh on each node, eventually node A has part A,B, and
node B will have part A,B too. Am I right?

Thanks.

On Thu, Jan 8, 2015 at 6:34 AM, Ryan Svihla <r...@foundev.pro> wrote:

> Just noticed you'd sent this to the dev list, this is a question for only
> the user list, and please do not send questions of this type to the
> developer list.
>
> On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla <r...@foundev.pro> wrote:
>
> > The nature of replication factor is such that writes will go wherever
> > there is replication. If you're wanting responses to be faster, and not
> > involve the REST data center in the spark job for response I suggest
> using
> > a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the
> > spark cassandra connector here
> > https://github.com/datastax/spark-cassandra-connector ) . While write
> > traffic will still be replicated to the REST service data center, because
> > you do want those results available, you will not be waiting on the
> remote
> > data center to respond "successful".
> >
> > Final point, bulk loading sends a copy per replica across the wire, so
> > lets say you have RF3 in each data center that means bulk loading will
> send
> > out 6 copies from that client at once, with normal mutations via thrift
> or
> > cql writes between data centers go out as 1 copy, then that node will
> > forward on to the other replicas. This means intra data center traffic in
> > this case would be 3x more with the bulk loader than with using a
> > traditional cql or thrift based client.
> >
> >
> >
> > On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <bewang.t...@gmail.com>
> wrote:
> >
> >> I set up two virtual data centers, one for analytics and one for REST
> >> service. The analytics data center sits top on Hadoop cluster. I want to
> >> bulk load my ETL results into the analytics data center so that the REST
> >> service won't have the heavy load. I'm using CQLTableInputFormat in my
> >> Spark Application, and I gave the nodes in analytics data center as
> >> Intialial address.
> >>
> >> However, I found my jobs were connecting to the REST service data
> center.
> >>
> >> How can I specify the data center?
> >>
> >
> >
> >
> > --
> >
> > Thanks,
> > Ryan Svihla
> >
> >
>
>
> --
>
> Thanks,
> Ryan Svihla
>

Reply via email to