On Fri, Jan 9, 2015 at 3:55 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Fri, Jan 9, 2015 at 11:38 AM, Benyi Wang <bewang.t...@gmail.com> wrote: > >> >> - Is it possible to modify SSTableLoader to allow it access one data >> center? >> >> Even if you only write to nodes in DC A, if you replicate that data to DC > B, it will have to travel over the WAN anyway? What are you trying to avoid? > > I'm lucky that those are virtual data centers in LAN. I just don't want to have a load burst in the "service" virtual data center because it may downgrade the REST service. I'm trying to load data into the "analytics" virtual data center, then let cassandra "slowly" replicates data into the "service" virtual data center. It is ok for the REST service to read some old data during the time of replication. I'm wondering if I should just use Throttle speed in Mbits to solve my problem? Because I may load ~100 million, I think spark-cassandra-connector might be >> too slow. I'm wondering if the methods "*Copy-the-sstables/”nodetool >> refresh” can be useful" in h*ttp:// >> www.pythian.com/blog/bulk-loading-options-for-cassandra/ will be a good >> choice. I'm still a newbie to Cassandra. I could not understand what the >> author said in that page. >> > > The author of that post is as wise as he is modest... ;D > > >> One of my question is: >> >> * When I run a spark job in yarn mode, the sstables are created into YARN >> working directory. >> * Assume I have a way to copy the files into the Cassandra directory on >> the same node. >> * Because the data are distributed across all analytics data center's >> nodes, each one has only a part of sstables, node A has part A, node B has >> part B. If I run refresh on each node, eventually node A has part A,B, and >> node B will have part A,B too. Am I right? >> > > I'm not sure I fully understand your question, but... > > In order to run refresh without having to immediately run cleanup, you > need to have SSTables which contain data only for ranges which the node you > are loading them on. > > So for a RF=3, N=3 cluster without vnodes (simple case), data is naturally > on every node. > > For RF=3, N=6 cluster A B C D E F, node C contains : > > - Third replica for A. > - Second replica for B. > - First replica for C. > > In order for you to generate the correct SSTable, you need to understand > all 3 replicas that should be there. With vnodes and nodes joining and > parting, this becomes more difficult. > > That's why people tend to use SSTableloader and the streaming interface : > with SSTableloader, Cassandra takes input which might live on any replica > and sends it to the appropriate nodes. > > =Rob > http://twitter.com/rcolidba > I'd better to stay at SSTableLoader. Thanks for your explanation.