Functionally dependent columns in SELECT DISTINCT
Hello! Postgres lets us omit columns from a GROUP BY clause if they are functionally dependent on a grouped key, which is a nice quality-of-life feature. I'm wondering if a similar relaxation could be permitted for the SELECT DISTINCT list? I have a query where I want to find the most recent few items from a table that match some complex condition, where the condition involves joining other tables. Here's an example, with two approaches: -- Store some data: an "item" has one or more "parts". CREATE TABLE items(id int PRIMARY KEY, create_time timestamptz); CREATE TABLE parts(part_id int PRIMARY KEY, item_id int); INSERT INTO items(id, create_time) SELECT i, now() - make_interval(secs => i) FROM generate_series(1, 100) s(i); INSERT INTO parts(item_id, part_id) SELECT items.id, 2 * items.id + delta FROM items, (VALUES(0), (1)) delta(delta); CREATE INDEX ON items(create_time DESC); CREATE INDEX ON parts(item_id); ANALYZE items, parts; -- Suppose we want to find the most recent few items with a part -- whose part ID is threeven. Two approaches: -- SELECT DISTINCT: fast, but superfluous column: EXPLAIN ANALYZE SELECT DISTINCT items.id, create_time FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 ORDER BY create_time DESC LIMIT 5; -- 4ms: -- parallel index scan on items_create_time_idx -- -> nested loop index scan parts_item_id_idx -- -> incremental sort -> gather merge -> unique -> limit -- GROUP BY: slow, but functional dependency recognized: EXPLAIN ANALYZE SELECT items.id FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 GROUP BY items.id ORDER BY create_time DESC LIMIT 5; -- 400ms: -- parallel seq scan on parts -- -> parallel hash join on item_id via seq scan on items -- -> sort -> group -> gather merge -> group -> sort -> limit These timings are Postgres 14.5 on a Linux i7-1165G7. With Postgres 16.3 on an Apple M3 Pro, the shape is the same: the GROUP BY is about 300ms, and the SELECT DISTINCT is way faster still, at 0.07ms. (It declines to parallelize, which seems to help.) I want to use the faster approach, and it works without issue so far. But that extra column in the SELECT list is a bit inconvenient. My questions are: - Do I understand right that these kinds of queries are equivalent? - If so, does the SQL standard permit Postgres to recognize functional dependency in this case, so that users may omit the order column column from the `SELECT DISTINCT` list? (I don't have a copy of the standard to check myself.) - Might future versions of Postgres allow this? thanks! ~Willow
Re: Functionally dependent columns in SELECT DISTINCT
On Thu, Sep 12, 2024 at 11:13 PM wrote: > > What about using DISTINCT ON () ? > SELECT DISTINCT ON (items.id) items.* > FROM items > JOIN parts ON items.id = parts.item_id > WHERE part_id % 3 = 0 > ORDER BY items.id,items.create_time DESC > LIMIT 5; > > This gives me this plan: https://explain.depesz.com/s/QHr6 on 16.2 (Windows, > i7-1260P) Ordering by items.id changes the answer, though. In the example I gave, items.id and items.create_time happened to be in the same order, but that needn't hold. In reality I really do want the ID columns of the *most recent* items. You can see the difference if you build the test dataset a bit differently: INSERT INTO items(id, create_time) SELECT i, now() - make_interval(secs => random() * 1e6) FROM generate_series(1, 100) s(i); We want the returned create_times to be all recent, and the IDs now should look roughly random.
Re: Functionally dependent columns in SELECT DISTINCT
Thanks both for your suggestions so far. On Fri, Sep 13, 2024 at 8:43 AM David G. Johnston wrote: > > On Friday, September 13, 2024, Willow Chargin wrote: >> >> In reality I really do want the ID columns of the >> *most recent* items. > > > Use a window function to rank them and pull out rank=1 Hmm, like this? noting that it's rank<=5, not rank=1: -- 1. rank all item-part combinations, densely since an item may have multiple parts -- 2. limit by rank, still retaining multiple copies of each item -- 3. de-duplicate IDs SELECT DISTINCT id FROM ( SELECT id, dense_rank FROM ( SELECT items.id, dense_rank() OVER (ORDER BY create_time DESC) FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 ) q WHERE dense_rank <= 5 ) q I've done this before, but my experience is that it's usually far slower because the rank is computed eagerly even for rows that don't match the rank bound. And indeed here it takes 20% longer than even the slower GROUP BY from before: https://explain.depesz.com/s/mQIi > or use a lateral subquery to surgically (fetch first 1) retrieve the first > row when sorted by recency descending. I'm not sure that I see how to apply this when I need top-k, not top-1.
Re: re-novice coming back to pgsql: porting an SQLite update statement to postgres
On Sun, Sep 15, 2024 at 4:23 AM Alban Hertroys wrote: > > > On 15 Sep 2024, at 11:07, Dan Kortschak wrote: > > > > I have come to hopefully my last stumbling point. > > > > I am unable to see a way to express something like this SQLite syntax > > > > select json_group_array(json_replace(value, > > '$.a', case > >when json_extract(value, '$.a') > 2 then > > 2 > >else > > json_extract(value, '$.a') > >end, > > '$.b', case > >when json_extract(value, '$.b') < -2 then > > -2 > >else > > json_extract(value, '$.b') > >end > > )) > > from > > json_each('[{"a":1, "b":-3},{"a":2, "b":-2},{"a":3, "b":-1}]'); > > [...] > > I see basically two approaches. One is to take the objects apart [...] > > with t as ( > select jsonb($$[{"a":1, "b":-3, "c":1},{"a":2, "b":-2, "c":2},{"a":3, > "b":-1, "c":3},{"a":3, "b":-3, "c":4}]$$) arr > ) > select jsonb_agg(jsonb_build_object( > 'a', case when records.a > 2 then 2 else records.a end > , 'b', case when records.b < -2 then -2 else records.b end > , 'c', c > )) > from t > cross join lateral jsonb_to_recordset(t.arr) records(a int, b int, c int) > ; > > [...] > > The drawback is that you have to specify all fields and types, but you don’t > need to cast the values all the time either. Here is a variant of Alban's first method that does not require specifying all fields and types, and so works with heterogeneous values: WITH t AS ( SELECT jsonb($$[ {"a": 1, "b": -3, "c": 1}, {"a": 2, "b": -2, "c": 2}, {"a": 3, "b": -1, "c": 3}, {"a": 3, "b": -3, "c": 4} ]$$) arr ) SELECT jsonb_agg(new_element ORDER BY idx) new_arr FROM t, LATERAL ( SELECT idx, jsonb_object_agg(key, CASE WHEN key = 'a' THEN least(old_value::numeric, 2)::text::jsonb WHEN key = 'b' THEN greatest(old_value::numeric, -2)::text::jsonb ELSE old_value END) FROM jsonb_array_elements(arr) WITH ORDINALITY old_elements(old_element, idx), jsonb_each(old_element) each(key, old_value) GROUP BY idx ) new_elements(idx, new_element) I also took the liberties of using `least` / `greatest` to simplify the clamping operations, and using `WITH ORDINALITY` / `ORDER BY` on the array scan and re-aggregation to make the element ordering explicit rather than relying on the query engine to not re-order the rows. https://www.postgresql.org/docs/16/functions-conditional.html#FUNCTIONS-GREATEST-LEAST https://www.postgresql.org/docs/16/queries-table-expressions.html#QUERIES-TABLEFUNCTIONS