On Oct 18, 2012, at 3:52 PM, Andrey Ilinykh <ailin...@gmail.com> wrote:
> On Thu, Oct 18, 2012 at 1:34 PM, Michael Kjellman > <mkjell...@barracuda.com> wrote: >> Not sure I understand your question (if there is one..) >> >> You are more than welcome to do CL ONE and assuming you have hadoop nodes >> in the right places on your ring things could work out very nicely. If you >> need to guarantee that you have all the data in your job then you'll need >> to use QUORUM. >> >> If you don't specify a CL in your job config it will default to ONE (at >> least that's what my read of the ConfigHelper source for 1.1.6 shows) >> > I have two questions. > 1. I can benefit from data locality (and Hadoop) only with CL ONE. Is > it correct? Yes and at QUORUM it's quasi local. The job tracker finds out where a range is and sends a task to a replica with the data (local). In the case of CL.QUORUM (see the Read Path section of http://wiki.apache.org/cassandra/ArchitectureInternals), it will do an actual read of the data on the node closest (local). Then it will get a digest from other nodes to verify that they have the same data. So in the case of RF=3 and QUORUM, it will read the data on the local node where the task is running and will check the next closest replica for a digest to verify that it is consistent. Information is sent across the wire and there is the latency of that, but it's not the data that's sent. > 2. With CL QUORUM cassandra reads data from all replicas. In this case > Hadoop doesn't give me any benefits. Application running outside the > cluster has the same performance. Is it correct? CL QUORUM does not read data from all replicas. Applications running outside the cluster have to copy the data from the cluster, a much more copy/network intensive operation than using CL.QUORUM with the built-in Hadoop support. > > Thank you, > Andrey