Hi Ryan, Thanks for your reply. Now I understood how SSTableLoader works.
- If I understand correctly, the current o.a.c.io.sstable.SSTableLoader doesn't use LOCAL_ONE or LOCAL_QUORUM. Is it right? - Is it possible to modify SSTableLoader to allow it access one data center? Because I may load ~100 million, I think spark-cassandra-connector might be too slow. I'm wondering if the methods "*Copy-the-sstables/”nodetool refresh” can be useful" in h*ttp:// www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good choice. I'm still a newbie to Cassandra. I could not understand what the author said in that page. One of my question is: * When I run a spark job in yarn mode, the sstables are created into YARN working directory. * Assume I have a way to copy the files into the Cassandra directory on the same node. * Because the data are distributed across all analytics data center's nodes, each one has only a part of sstables, node A has part A, node B has part B. If I run refresh on each node, eventually node A has part A,B, and node B will have part A,B too. Am I right? Thanks. On Thu, Jan 8, 2015 at 6:34 AM, Ryan Svihla <r...@foundev.pro> wrote: > Just noticed you'd sent this to the dev list, this is a question for only > the user list, and please do not send questions of this type to the > developer list. > > On Thu, Jan 8, 2015 at 8:33 AM, Ryan Svihla <r...@foundev.pro> wrote: > > > The nature of replication factor is such that writes will go wherever > > there is replication. If you're wanting responses to be faster, and not > > involve the REST data center in the spark job for response I suggest > using > > a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the > > spark cassandra connector here > > https://github.com/datastax/spark-cassandra-connector ) . While write > > traffic will still be replicated to the REST service data center, because > > you do want those results available, you will not be waiting on the > remote > > data center to respond "successful". > > > > Final point, bulk loading sends a copy per replica across the wire, so > > lets say you have RF3 in each data center that means bulk loading will > send > > out 6 copies from that client at once, with normal mutations via thrift > or > > cql writes between data centers go out as 1 copy, then that node will > > forward on to the other replicas. This means intra data center traffic in > > this case would be 3x more with the bulk loader than with using a > > traditional cql or thrift based client. > > > > > > > > On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <bewang.t...@gmail.com> > wrote: > > > >> I set up two virtual data centers, one for analytics and one for REST > >> service. The analytics data center sits top on Hadoop cluster. I want to > >> bulk load my ETL results into the analytics data center so that the REST > >> service won't have the heavy load. I'm using CQLTableInputFormat in my > >> Spark Application, and I gave the nodes in analytics data center as > >> Intialial address. > >> > >> However, I found my jobs were connecting to the REST service data > center. > >> > >> How can I specify the data center? > >> > > > > > > > > -- > > > > Thanks, > > Ryan Svihla > > > > > > > -- > > Thanks, > Ryan Svihla >