Nadav Har'El created CASSANDRA-20470:
----------------------------------------

             Summary: Document SELECT order without ORDER BY
                 Key: CASSANDRA-20470
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20470
             Project: Apache Cassandra
          Issue Type: Improvement
          Components: Documentation
            Reporter: Nadav Har'El


In SQL, doing a "SELECT" without an ORDER BY does not guarantee any order of 
the results.

I've been using Cassandra for many years and until today, I was under the 
impression that in Cassandra, things are different: I thought that SELECT 
officially defaults to the "standard" order which sorts partitions according to 
tokens (or the partition key in case of tie) and inside a partition rows are 
sorted according to the clustering key(s). I was under the impression that 
Cassandra users rarely, if ever, use "ORDER BY ASC(c)" in each SELECT query - 
they just leave it out when wanting the default sort order.

I think this is impression is also hard-coded in some of the unit tests. For 
example, cql3/validation/operations/SelectTest.java checks the result order 
with various ORDER BY, but its first check actually omits the ORDER BY and 
verifies that the results come in ascending clustering key order. 

But today I realized that this isn't documented anywhere I can find! I couldn't 
find any explicit documentation of what happens in a SELECT *without* an "ORDER 
BY". Is the order completely unspecified (as in SQL) or can users expect the 
default sort order is the "standard" one which I mentioned above? My request in 
this issue is to make the answer to this question clear and explicit in the 
documentation.

Given that Cassandra already goes to all the trouble to guarantee a specific 
order of the rows on disk (and the on-disk part *is* documented), it seems 
strange not to make the most efficient way to read it - in that same order - be 
the default, and document that it is the default. But if this just "seems" to 
be the default but for some reason is *not* the default, then this fact should 
be documented, so users won't rely on it. Although to be honest, I'll be 
surprised if most Cassandra users don't rely on this already - it seems I 
always did... If this default sort order also can't be relied on, probably the 
unit tests that assume it will need to be changed, to use 
assertRowsIgnoringOrder() or something.

Interestingly, DynamoDB which whose data model is more-or-less identical to 
Cassandra's, made the explicit decision that the default sort order always 
applies to queries: Their clustering key is called a "sort key" and rows are 
always sorted by it - e.g., 
[https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html] 
says "Query results are always sorted by the sort key value.".

Beyond the above, I think we also need to document how sort order applies when 
SELECT uses a secondary index. When the query is restricted by an indexed 
column, an explicit "ORDER BY" is outright forbidden (the error message is 
"ORDER BY with 2ndary indexes is not supported, except for ANN queries."), but 
more importantly it seems that although "usually" the standard default sort 
order is maintained in such queries, this is not guaranteed if one node returns 
a long list of results and the coordinator decides not to even try other nodes. 
I think this fact can cause unpleasant surprises for application developers - 
results will be sorted as expected when testing on a a small installation, but 
when moving to a large many-node cluster, suddenly the sort order will no 
longer be guaranteed. I think this should be documented as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to