Not Patrick, but: - would also love being closer to SQL - there’s no work on this specific grammar, yet - it would depend on a real query optimizer, which IS somewhat in flight (or at least a cost based optimizer was proposed)
> On Apr 7, 2025, at 2:05 PM, Artem Golovko <artemgolovk...@gmail.com> wrote: > > Hi Patrick, > > Really good point, I even did not think about it and actually > completely forgot that ORDER BY with DISTINCT will sort the result > within the group only, but has nothing with ordering of the final > result. I totally agree that aligning CQL with standard SQL behavior > would be a great idea. By the way, are there any open projects or > discussions around this? Or is it still just an internal PoC at this > stage? > > Artem > > пт, 4 апр. 2025 г. в 17:02, Patrick McFadin <pmcfa...@gmail.com>: >> >> I played around with this idea by simulating it in ChatGPT (Yes you can do >> that) It occurred to me that this is similar SQL functionality to the >> DISTINCT keyword. Seeing how we can align CQL with SQL is something I'm >> personally investing more time in for the long-term of the project. This >> could be an opportunity to get one step closer with useful syntax. >> >> Re-arranging your idea in SQL syntax, it would look like this: >> >> SELECT DISTINCT ON (sensor_id) device_id, sensor_id, time, value >> FROM data >> WHERE device_id = 'mydevice' >> AND sensor_id IN ('s1', 's2', 's3') >> ORDER BY sensor_id, time DESC; >> >> I think this is the same outcome and similar partition-level implementation. >> DISTINCT on a multi-partition query would return the first value of each >> partition. This would especially work in these types of primary keys: >> PRIMARY KEY((device_id, sensor_id), time) >> >> In the long term, we don't have more unique syntax building up, which I >> really prefer. >> >> Patrick >> >>> On Tue, Apr 1, 2025 at 9:55 AM Artem Golovko <artemgolovk...@gmail.com> >>> wrote: >>> >>> Hello everyone, >>> >>> I did not find any discussions about that topic and would like to ask >>> if there any considerations to introduce the "PER PARTITION ORDER" >>> functionality. It's a duplication of Scylla question, but now for >>> Cassandra https://forum.scylladb.com/t/per-partition-local-ordering/3412. >>> I am also not so experienced from the cassandra code implementation >>> point of view, but according to my knowledge it should make sense. >>> >>> Let me introduce the use case. >>> >>> Data model: >>> >>> CREATE TABLE data( >>> device_id TEXT, >>> sensor_id TEXT, >>> time TIMESTAMP, >>> value BLOB, >>> PRIMARY KEY((device_id, sensor_id), time) >>> ) >>> >>> Queries: Give me the first and the last value for all sensors within >>> deviceId. >>> >>> Problem: Within the device it's possible to have 10k of sensors or >>> more and if we wanted to get a "snapshot" (e.g. list of sensors with >>> values having the max timestamp) then it may take lots of round trips >>> for small request-response. Therefore we can use the "IN" clause here, >>> grouping keys based on the replica node (e.g. batch node aware read). >>> >>> 1. First point >>> SELECT * FROM data WHERE deviceId = 'mydevice' and sensor_id IN (‘s1’, >>> ‘s2’, ‘s3’) PER PARTITION LIMIT 1 >>> >>> Here we can get the first point for each partition and don’t care >>> about “global” ordering, so the resulting rows won’t be sorted by >>> clustering key and natural order will be applied only locally within >>> each partition. >>> >>> 2. Last point >>> SELECT * FROM data WHERE deviceId = 'mydevice' and sensor_id IN (‘s1’, >>> ‘s2’, ‘s3’) ORDER BY time DESC PER PARTITION LIMIT 1 >>> >>> It’s not possible to use IN and ORDER BY together with paging enabled. >>> The reason is that Cassandra applies order “localy” within each >>> partition, but also applies it “globally” across the resulting rows >>> that makes cassandra store the result in-memory to apply “global” >>> sorting. But if I don’t care about “global” ordering and only want to >>> specify ordering within each partition that introduces performance >>> overhead. >>> >>> What if to introduce "PER PARTITION ORDER" statement? In most of the >>> use cases it should not introduce much benefits, because we're limited >>> to the number of keys in the IN clause (by default 100), so the result >>> should not be big enough to do not fit into the memory, but maybe >>> someone have another use case when PER PARTITION LIMIT more than 1 or >>> payload is big enough.