On Oct 18, 2012, at 3:52 PM, Andrey Ilinykh <ailin...@gmail.com> wrote:

> On Thu, Oct 18, 2012 at 1:34 PM, Michael Kjellman
> <mkjell...@barracuda.com> wrote:
>> Not sure I understand your question (if there is one..)
>> 
>> You are more than welcome to do CL ONE and assuming you have hadoop nodes
>> in the right places on your ring things could work out very nicely. If you
>> need to guarantee that you have all the data in your job then you'll need
>> to use QUORUM.
>> 
>> If you don't specify a CL in your job config it will default to ONE (at
>> least that's what my read of the ConfigHelper source for 1.1.6 shows)
>> 
> I have two questions.
> 1. I can benefit from data locality (and Hadoop) only with CL ONE. Is
> it correct?

Yes and at QUORUM it's quasi local.  The job tracker finds out where a range is 
and sends a task to a replica with the data (local).  In the case of CL.QUORUM 
(see the Read Path section of 
http://wiki.apache.org/cassandra/ArchitectureInternals), it will do an actual 
read of the data on the node closest (local).  Then it will get a digest from 
other nodes to verify that they have the same data.  So in the case of RF=3 and 
QUORUM, it will read the data on the local node where the task is running and 
will check the next closest replica for a digest to verify that it is 
consistent.  Information is sent across the wire and there is the latency of 
that, but it's not the data that's sent.

> 2. With CL QUORUM cassandra reads data from all replicas. In this case
> Hadoop doesn't give me any  benefits. Application running outside the
> cluster has the same performance. Is it correct?

CL QUORUM does not read data from all replicas.  Applications running outside 
the cluster have to copy the data from the cluster, a much more copy/network 
intensive operation than using CL.QUORUM with the built-in Hadoop support.

> 
> Thank you,
>  Andrey

Reply via email to