1. Yes, you can absolutely benefit from data locality, and the InputSplits will theoretically schedule the map task on Cassandra+Hadoop nodes that have the data locally. If your application doesn't require you to worry about that one pesky row that should be local to that node (and that node is responsible for it but for some reason the data isn't there) then go ahead and run it with CF ONE. In a perfect world all of the rows should be there but any seasoned Cassandra user use knows that exceptions happen.
If what Bryan says is right then your first MR job, the mapper would be missing that row but the subsequent run would contain that data as the read repair would be triggered in the background. Once again, how important it is that you get all your data 100% of the time? 2. I would consider thinking a little more about your project if you are planning on using Hadoop only for data locality. I would say it depends if your workload would benefit from Hadoop and distributed processing. Hadoop provides many benefits but, if you require QUORUM consistency and you don't have a work load that lends itself to a input > output distributed workload then Hadoop might not be the right tool for the job. Best, Michael On 10/18/12 1:52 PM, "Andrey Ilinykh" <ailin...@gmail.com> wrote: >On Thu, Oct 18, 2012 at 1:34 PM, Michael Kjellman ><mkjell...@barracuda.com> wrote: >> Not sure I understand your question (if there is one..) >> >> You are more than welcome to do CL ONE and assuming you have hadoop >>nodes >> in the right places on your ring things could work out very nicely. If >>you >> need to guarantee that you have all the data in your job then you'll >>need >> to use QUORUM. >> >> If you don't specify a CL in your job config it will default to ONE (at >> least that's what my read of the ConfigHelper source for 1.1.6 shows) >> >I have two questions. >1. I can benefit from data locality (and Hadoop) only with CL ONE. Is >it correct? >2. With CL QUORUM cassandra reads data from all replicas. In this case >Hadoop doesn't give me any benefits. Application running outside the >cluster has the same performance. Is it correct? > >Thank you, > Andrey 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook