> My high-level understanding of how Cassandra handles a SELECT is that : > (excuse incorrect terminology) > 1. client connects to some node N > 2. node N acts as a kind of coordinator and fires off the thrift or > binary-protocol messages > to all other nodes to fetch rows off the memtables and/or disks The internode messages are a custom binary protocol, not the thrift / native api messages. These messages are also used on the node to move your request into the appropriate thread pooll.
The nodes reads the data needed for the request as if it was the only node performing the request. The only time we act differently is when sending the data back to the coordinator. > 3. coordinator merges, truncates, etc the sets from the nodes and > returns one answer set to client. > The coordinator simply compares the results from the replicas and determines if the match. It does not merge or truncate. If they do not match we perform the read again, but this time transmit some extra data so we can resolve differences. > It is step 3 which has me wondering - does it explicitly preserve the > on-disk order? Order from the on disk read (including reverse ordered in the select statement) is preserved in the serialisation process. After which we never order again. > In fact - does it simply keep each individual node's answer set separate? > Is that how it works? I did some recent webinars for PlanetCassandra that may help: Introduction to Apache Cassandra 1.2 http://thelastpickle.com/speaking/2013/04/25/Community-Webinar.html Talks about the read / write and cluster process at a high level. Cassandra Internals http://thelastpickle.com/speaking/2013/08/25/Cassandra-Community-Webinar.html Goes deep into the code to explain how cassandra works. Hope that helps. ----------------- Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 13/09/2013, at 1:11 AM, John Lumby <johnlu...@hotmail.com> wrote: > Aaron, thanks for the super-rapid response. That clarifies a lot for me, > but I think I am still wondering about one point embedded below. > > ________________________________ >> From: aa...@thelastpickle.com >> Subject: Re: is the select result grouped by the value of the partition key? >> Date: Thu, 12 Sep 2013 14:19:06 +1200 >> To: user@cassandra.apache.org >> >> GROUP BY "feature", >> I would not think of it like that, this is about physical order of rows. >> >> since it seems really important yet does not seem to be mentioned in the >> CQL reference documentation. >> It's baked in, this is how the data is organised on the row. > > Yes, I see, and I absolutely get the relevance of where columns are > stored on disk to, > say, doing INSERTs. > But what I am wondering about is, in the context of a SELECT, we seem to > be relying on > the Cassandra client api preserving that on-disk order while returning rows. > My high-level understanding of how Cassandra handles a SELECT is that : > (excuse incorrect terminology) > 1. client connects to some node N > 2. node N acts as a kind of coordinator and fires off the thrift or > binary-protocol messages > to all other nodes to fetch rows off the memtables and/or disks > 3. coordinator merges, truncates, etc the sets from the nodes and > returns one answer set to client. > > It is step 3 which has me wondering - does it explicitly preserve the > on-disk order? > In fact - does it simply keep each individual node's answer set separate? > Is that how it works? > >> >> http://www.datastax.com/dev/blog/thrift-to-cql3 >> We often say the PRIMARY KEY is the PARTITION KEY and the GROUPING COLUMNS >> http://www.datastax.com/documentation/cql/3.0/webhelp/index.html#cql/cql_reference/create_table_r.html >> >> >> See also http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html >> >> Is it something we can bet the farm and farmer's family on? >> Sure. >> >> The kinds of scenarios where I am wondering if it's possible for >> partition-key groups >> to get intermingled are : >> All instances of the table entity with the same value(s) for the >> PARTITION KEY portion of the PRIMARY KEY existing in the same storage >> engine row. >> >> . what if the node containing primary copy of a row is down >> There is no primary copy of a row. >> >> . what if there is a heavy stream of UPDATE activity from >> applications which >> connect to all nodes, causing different nodes to have different >> versions of replicas of same row? >> That's fine with me. >> It's only an issue when the data is read, and at that point the >> Consistency Level determines what we do. >> >> Hope that helps. >> >> >> ----------------- >> Aaron Morton >> New Zealand >> @aaronmorton >> >> Co-Founder & Principal Consultant >> Apache Cassandra Consulting >> http://www.thelastpickle.com >> >> On 12/09/2013, at 7:43 AM, John Lumby >> <johnlu...@hotmail.com<mailto:johnlu...@hotmail.com>> wrote: >> >> I would like to make quite sure about this implicit GROUP BY "feature", >> >> since it seems really important yet does not seem to be mentioned in the >> CQL reference documentation. >> >> >> >> Aaron, you said "yes" -- is that "yes, always, in all scenarios >> no matter what" >> >> or "yes usually"? Is it something we can bet the farm and farmer's >> family on? >> >> >> >> The kinds of scenarios where I am wondering if it's possible for >> partition-key groups >> to get intermingled are : >> >> >> >> . what if the node containing primary copy of a row is down >> and >> cassandra fetches this row from a replica on a different node >> (e.g. with CONSISTENCY ONE) >> >> . what if there is a heavy stream of UPDATE activity from >> applications which >> connect to all nodes, causing different nodes to have different >> versions of replicas of same row? >> >> >> >> Can you point me to some place in the cassandra source code where this >> grouping is ensured? >> >> >> >> Many thanks, >> >> John Lumby >>