Nadav Har'El created CASSANDRA-20470: ----------------------------------------
Summary: Document SELECT order without ORDER BY Key: CASSANDRA-20470 URL: https://issues.apache.org/jira/browse/CASSANDRA-20470 Project: Apache Cassandra Issue Type: Improvement Components: Documentation Reporter: Nadav Har'El In SQL, doing a "SELECT" without an ORDER BY does not guarantee any order of the results. I've been using Cassandra for many years and until today, I was under the impression that in Cassandra, things are different: I thought that SELECT officially defaults to the "standard" order which sorts partitions according to tokens (or the partition key in case of tie) and inside a partition rows are sorted according to the clustering key(s). I was under the impression that Cassandra users rarely, if ever, use "ORDER BY ASC(c)" in each SELECT query - they just leave it out when wanting the default sort order. I think this is impression is also hard-coded in some of the unit tests. For example, cql3/validation/operations/SelectTest.java checks the result order with various ORDER BY, but its first check actually omits the ORDER BY and verifies that the results come in ascending clustering key order. But today I realized that this isn't documented anywhere I can find! I couldn't find any explicit documentation of what happens in a SELECT *without* an "ORDER BY". Is the order completely unspecified (as in SQL) or can users expect the default sort order is the "standard" one which I mentioned above? My request in this issue is to make the answer to this question clear and explicit in the documentation. Given that Cassandra already goes to all the trouble to guarantee a specific order of the rows on disk (and the on-disk part *is* documented), it seems strange not to make the most efficient way to read it - in that same order - be the default, and document that it is the default. But if this just "seems" to be the default but for some reason is *not* the default, then this fact should be documented, so users won't rely on it. Although to be honest, I'll be surprised if most Cassandra users don't rely on this already - it seems I always did... If this default sort order also can't be relied on, probably the unit tests that assume it will need to be changed, to use assertRowsIgnoringOrder() or something. Interestingly, DynamoDB which whose data model is more-or-less identical to Cassandra's, made the explicit decision that the default sort order always applies to queries: Their clustering key is called a "sort key" and rows are always sorted by it - e.g., [https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html] says "Query results are always sorted by the sort key value.". Beyond the above, I think we also need to document how sort order applies when SELECT uses a secondary index. When the query is restricted by an indexed column, an explicit "ORDER BY" is outright forbidden (the error message is "ORDER BY with 2ndary indexes is not supported, except for ANN queries."), but more importantly it seems that although "usually" the standard default sort order is maintained in such queries, this is not guaranteed if one node returns a long list of results and the coordinator decides not to even try other nodes. I think this fact can cause unpleasant surprises for application developers - results will be sorted as expected when testing on a a small installation, but when moving to a large many-node cluster, suddenly the sort order will no longer be guaranteed. I think this should be documented as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org