Re: Per partition local ordering

Patrick McFadin Fri, 04 Apr 2025 09:02:44 -0700

I played around with this idea by simulating it in ChatGPT (Yes you can do
that) It occurred to me that this is similar SQL functionality to the
DISTINCT keyword. Seeing how we can align CQL with SQL is something I'm
personally investing more time in for the long-term of the project. This
could be an opportunity to get one step closer with useful syntax.


Re-arranging your idea in SQL syntax, it would look like this:

SELECT DISTINCT ON (sensor_id) device_id, sensor_id, time, value
FROM data
WHERE device_id = 'mydevice'
  AND sensor_id IN ('s1', 's2', 's3')
ORDER BY sensor_id, time DESC;

I think this is the same outcome and similar partition-level
implementation. DISTINCT on a multi-partition query would return the first
value of each partition. This would especially work in these types of
primary keys: PRIMARY KEY((device_id, sensor_id), time)

In the long term, we don't have more unique syntax building up, which I
really prefer.

Patrick

On Tue, Apr 1, 2025 at 9:55 AM Artem Golovko <[email protected]>
wrote:

> Hello everyone,
>
> I did not find any discussions about that topic and would like to ask
> if there any considerations to introduce the "PER PARTITION ORDER"
> functionality. It's a duplication of Scylla question, but now for
> Cassandra https://forum.scylladb.com/t/per-partition-local-ordering/3412.
> I am also not so experienced from the cassandra code implementation
> point of view, but according to my knowledge it should make sense.
>
> Let me introduce the use case.
>
> Data model:
>
> CREATE TABLE data(
>    device_id TEXT,
>    sensor_id TEXT,
>    time TIMESTAMP,
>    value BLOB,
>    PRIMARY KEY((device_id, sensor_id), time)
> )
>
> Queries: Give me the first and the last value for all sensors within
> deviceId.
>
> Problem: Within the device it's possible to have 10k of sensors or
> more and if we wanted to get a "snapshot" (e.g. list of sensors with
> values having the max timestamp) then it may take lots of round trips
> for small request-response. Therefore we can use the "IN" clause here,
> grouping keys based on the replica node (e.g. batch node aware read).
>
> 1. First point
> SELECT * FROM data WHERE deviceId = 'mydevice' and sensor_id IN (‘s1’,
> ‘s2’, ‘s3’) PER PARTITION LIMIT 1
>
> Here we can get the first point for each partition and don’t care
> about “global” ordering, so the resulting rows won’t be sorted by
> clustering key and natural order will be applied only locally within
> each partition.
>
> 2. Last point
> SELECT * FROM data WHERE deviceId = 'mydevice' and sensor_id IN (‘s1’,
> ‘s2’, ‘s3’) ORDER BY time DESC PER PARTITION LIMIT 1
>
> It’s not possible to use IN and ORDER BY together with paging enabled.
> The reason is that Cassandra applies order “localy” within each
> partition, but also applies it “globally” across the resulting rows
> that makes cassandra store the result in-memory to apply “global”
> sorting. But if I don’t care about “global” ordering and only want to
> specify ordering within each partition that introduces performance
> overhead.
>
> What if to introduce "PER PARTITION ORDER" statement? In most of the
> use cases it should not introduce much benefits, because we're limited
> to the number of keys in the IN clause (by default 100), so the result
> should not be big enough to do not fit into the memory, but maybe
> someone have another use case when PER PARTITION LIMIT more than 1 or
> payload is big enough.
>

Re: Per partition local ordering

Reply via email to